Show HN: Fantail SLMs for Coding Agents

3 hours ago 1

A family of smol models for coding agents

Preview: Fantail‑pro approaches Sonnet 4.5 performance at ~100× lower cost per task [7].

Autohand Team

• October 29, 2025 • 7 min read

Ashburton, Aotearoa New Zealand. Meet Fantail, our family of small language models for AI coding agents. Named after the pīwakawaka, it’s built to be fast, light, and steady for local‑first workflows with reliable accuracy. Open weights will be released later this week under CC BY 4.0.

Pied fantail over Waituna Lagoon, Southland, New Zealand

Fantail focuses on what small models should do best: short-turn reasoning, retrieval-aware chat, code assistance, tagging and routing, and low-latency tools. It fits where you need it: on servers, on laptops, and at the edge. It starts fast, streams quickly, and keeps costs predictable.

Three sizes

Fantail-mini - 0.5B params - fastest and smallest
Fantail-base - 1.3B params - balanced for RAG and chat
Fantail-pro - 3B params - strongest for code and planning

Context windows: 8K or 32K. Quantization: Q4, Q5, Q8, or FP16. Deployment: local via Ollama or an inference server, or hosted API.

Why this matters

Teams do not always need the biggest model. They need the right model in the right place with the right latency. Fantail gives you that choice and keeps the door open for strong reasoning on a tight budget. It reduces round trips, makes tool calls feel instant, and keeps data private by running close to where work happens.

Performance at a glance

Fantail aims to set a solid baseline for the 0.5B to 3B class. We focus on coding-oriented agentic benchmarks, primarily Terminal-Bench. Replace placeholder values with your latest runs. See footnotes for setup details and notes on eval hygiene.

Agentic benchmarks

Task Fantail-mini Fantail-base Fantail-pro Agentic terminal coding Terminal-Bench

31.4%	38.6%	42.0%

Preview metrics only — subject to change after full reruns. Values are pass rates in percent. Evaluate with fresh seeds and report mean over three runs for stability. [6]

Terminal-Bench bar chart for Fantail-mini, Fantail-base, Fantail-pro, GPT-5 mini, GPT-5, and Sonnet 4.5 (reference).

Options

Pick the size that matches your workload. Mix and match context, quantization, and deployment. You can change any option later with zero code changes in Commander or your existing orchestration.

Sizes: 0.5B, 1.3B, 3B
Context: 8K or 32K
Quantization: Q4, Q5, Q8, FP16
Deployment: local via Ollama or inference server, or hosted API
Output: text, JSON mode, or BNF-constrained format
Safety: off, standard, or strict policies

Fantail-mini

Routing, tagging, short answers, on-device agents.

Fantail-base

RAG chat, data extraction, basic code edits.

Fantail-pro

Coding help, step-by-step tasks, small plans.

PRFAQ

What is Fantail?

Fantail is a small language model line from Autohand. It comes in three sizes and is tuned for fast, reliable responses. It is designed to be useful in tight loops where latency matters and privacy is important.

Why is this important?

Most AI work needs speed more than breadth. A small model that starts instantly can answer, route, or call a tool before a large model finishes spinning up. That saves time and cost while keeping data closer to home.

How does Fantail compare to other small models?

Fantail targets the practical middle: solid accuracy, predictable behavior, and steady throughput. On open benchmarks like MMLU and HumanEval, Fantail-pro performs competitively for the 3B class. See charts and notes. [1] [2]

How fast is it?

On a modern laptop GPU, Fantail-mini streams at high tokens per second. Fantail-pro stays responsive and supports JSON mode without a drop in throughput. See the latency chart for a typical setup. [3]

Can I run it on-device?

Yes. Fantail-mini and Fantail-base run comfortably on consumer GPUs and recent Apple Silicon. They also run in Docker on standard cloud instances. You can switch between local and hosted endpoints with a single config change.

Does it support structured output?

Yes. Fantail supports JSON mode and BNF-style constrained decoding. This keeps tool calls and data extraction predictable.

What context window options are available?

Each size ships with 8K or 32K context. Pick what you need for your retrieval or code tasks. Longer contexts cost more memory. The 8K variants are ideal for local-first workflows.

How was it trained?

Fantail uses a mix of permissively licensed and synthetic data. We apply staged training: base pretraining, safety and format instruction, and targeted task tuning. We do not train on private customer data. [4]

What about safety and privacy?

Fantail supports safety policies that you can tune per deployment. For privacy, local and VPC deployment options keep tokens on infrastructure you control. We publish guidance for safe prompts and evals. [5]

How is it priced?

Fantail will ship as open weights under CC BY 4.0. You can download and run locally (for example, via Ollama) at no cost beyond your own compute, provided you include attribution as required by the license. If we offer a managed endpoint later, we’ll announce usage-based pricing and support tiers. For enterprise support and SLAs, contact us.

When can I try it?

We are releasing open weights later this week. You can test locally via Ollama by pulling the models from Hugging Face once they are published. Commander integrates with Fantail out of the box, so you can swap sizes without code changes.

Try with Ollama (later this week)

Models: huggingface.co/autohandai (weights appear here on release)
Ollama (Modelfile): FROM hf://autohandai/fantail-base-8k-q5 PARAMETER temperature 0.2 Create and run: ollama create fantail-base -f Modelfile then ollama run fantail-base
Commander: set model: "fantail-pro-32k-q8" in your workflow once weights are available

MMLU: 5-shot, standard test split, prompt format fixed per size. Numbers shown are representative runs, not best-of sweeps.
HumanEval: pass@1 with exact match, Python only, no external tools. Temperature and top-p fixed across sizes.
Latency: measured on M2 Max laptop and a single T4 in Docker. Tokenization included. Batch size 1. Streaming on.
Training data: permissive licenses, curated web, and synthetic tasks. We filter and deduplicate aggressively.
Safety: default policies align with common platform guidelines. You can tune or disable within your VPC.
Terminal-Bench: agentic terminal coding benchmark. Replace with your link to the official repository or paper.
Cost estimate: preview calculation comparing local Fantail inference (quantized, batch size 1) against published API list prices for a frontier model. Assumes identical tasks and response quality thresholds on Terminal‑Bench harness settings; excludes data egress and storage. Subject to revision after full reruns.

Methodology and Notes

1. Access policy. Customers in regulated industries (for example, cybersecurity and biological research) can work with their account teams to join our early-access allowlist for private endpoints and VPC deployment.

Terminal-Bench. All scores use the default agent framework (Terminus 2) with the XML parser enabled. We average multiple runs on different days to smooth sensitivity to inference infrastructure. Each run uses batch size 1, streaming enabled, no external browsing, and JSON/BNF-constrained tool arguments when applicable. Seeds vary per run and the harness commit is recorded. We report the mean pass rate and include error bars in the paper version.

Evaluation settings. Unless noted, Fantail runs use JSON mode or BNF-constrained decoding for tool calls, temperature tuned per task family, and a fixed top-p. Local runs include M2 Max and single T4 configurations; cloud runs use a Docker image with cuda-enabled backends. Logs include timestamps, seeds, and command-line flags. See the paper for exact commands and harness versions.

Baselines. Baseline numbers for non-Fantail models are sourced from public posts and leaderboards: vendor model pages, Terminal-Bench leaderboard (Terminus 2), and public tool-use leaderboards where available. Where a range exists, we choose the cited configuration most comparable to ours and note deviations in the paper.

Read Entire Article