A family of smol models for coding agents
Preview: Fantail‑pro approaches Sonnet 4.5 performance at ~100× lower cost per task [7].
Autohand Team
• October 29, 2025 • 7 min read
Ashburton, Aotearoa New Zealand. Meet Fantail, our family of small language models for AI coding agents. Named after the pīwakawaka, it’s built to be fast, light, and steady for local‑first workflows with reliable accuracy. Open weights will be released later this week under CC BY 4.0.
Pied - Waituna Lagoon, Southland - ©Glenda Rees
Fantail focuses on what small models should do best: short-turn reasoning, retrieval-aware chat, code assistance, tagging and routing, and low-latency tools. It fits where you need it: on servers, on laptops, and at the edge. It starts fast, streams quickly, and keeps costs predictable.
Three sizes
- Fantail-mini - 0.5B params - fastest and smallest
- Fantail-base - 1.3B params - balanced for RAG and chat
- Fantail-pro - 3B params - strongest for code and planning
Context windows: 8K or 32K. Quantization: Q4, Q5, Q8, or FP16. Deployment: local via Ollama or an inference server, or hosted API.
Why this matters
Teams do not always need the biggest model. They need the right model in the right place with the right latency. Fantail gives you that choice and keeps the door open for strong reasoning on a tight budget. It reduces round trips, makes tool calls feel instant, and keeps data private by running close to where work happens.
Performance at a glance
Fantail aims to set a solid baseline for the 0.5B to 3B class. We focus on coding-oriented agentic benchmarks, primarily Terminal-Bench. Replace placeholder values with your latest runs. See footnotes for setup details and notes on eval hygiene.
Agentic benchmarks
| 31.4% | 38.6% | 42.0% |
Preview metrics only — subject to change after full reruns. Values are pass rates in percent. Evaluate with fresh seeds and report mean over three runs for stability. [6]
Options
Pick the size that matches your workload. Mix and match context, quantization, and deployment. You can change any option later with zero code changes in Commander or your existing orchestration.
- Sizes: 0.5B, 1.3B, 3B
- Context: 8K or 32K
- Quantization: Q4, Q5, Q8, FP16
- Deployment: local via Ollama or inference server, or hosted API
- Output: text, JSON mode, or BNF-constrained format
- Safety: off, standard, or strict policies
Fantail-mini
Routing, tagging, short answers, on-device agents.
Fantail-base
RAG chat, data extraction, basic code edits.
Fantail-pro
Coding help, step-by-step tasks, small plans.
PRFAQ
What is Fantail?
Fantail is a small language model line from Autohand. It comes in three sizes and is tuned for fast, reliable responses. It is designed to be useful in tight loops where latency matters and privacy is important.
Why is this important?
Most AI work needs speed more than breadth. A small model that starts instantly can answer, route, or call a tool before a large model finishes spinning up. That saves time and cost while keeping data closer to home.
How does Fantail compare to other small models?
Fantail targets the practical middle: solid accuracy, predictable behavior, and steady throughput. On open benchmarks like MMLU and HumanEval, Fantail-pro performs competitively for the 3B class. See charts and notes. [1] [2]
How fast is it?
On a modern laptop GPU, Fantail-mini streams at high tokens per second. Fantail-pro stays responsive and supports JSON mode without a drop in throughput. See the latency chart for a typical setup. [3]
Can I run it on-device?
Yes. Fantail-mini and Fantail-base run comfortably on consumer GPUs and recent Apple Silicon. They also run in Docker on standard cloud instances. You can switch between local and hosted endpoints with a single config change.
Does it support structured output?
Yes. Fantail supports JSON mode and BNF-style constrained decoding. This keeps tool calls and data extraction predictable.
What context window options are available?
Each size ships with 8K or 32K context. Pick what you need for your retrieval or code tasks. Longer contexts cost more memory. The 8K variants are ideal for local-first workflows.
How was it trained?
Fantail uses a mix of permissively licensed and synthetic data. We apply staged training: base pretraining, safety and format instruction, and targeted task tuning. We do not train on private customer data. [4]
What about safety and privacy?
Fantail supports safety policies that you can tune per deployment. For privacy, local and VPC deployment options keep tokens on infrastructure you control. We publish guidance for safe prompts and evals. [5]
How is it priced?
Fantail will ship as open weights under CC BY 4.0. You can download and run locally (for example, via Ollama) at no cost beyond your own compute, provided you include attribution as required by the license. If we offer a managed endpoint later, we’ll announce usage-based pricing and support tiers. For enterprise support and SLAs, contact us.
When can I try it?
We are releasing open weights later this week. You can test locally via Ollama by pulling the models from Hugging Face once they are published. Commander integrates with Fantail out of the box, so you can swap sizes without code changes.
Try with Ollama (later this week)
- Models: huggingface.co/autohandai (weights appear here on release)
- Ollama (Modelfile): FROM hf://autohandai/fantail-base-8k-q5 PARAMETER temperature 0.2 Create and run: ollama create fantail-base -f Modelfile then ollama run fantail-base
- Commander: set model: "fantail-pro-32k-q8" in your workflow once weights are available
- MMLU: 5-shot, standard test split, prompt format fixed per size. Numbers shown are representative runs, not best-of sweeps.
- HumanEval: pass@1 with exact match, Python only, no external tools. Temperature and top-p fixed across sizes.
- Latency: measured on M2 Max laptop and a single T4 in Docker. Tokenization included. Batch size 1. Streaming on.
- Training data: permissive licenses, curated web, and synthetic tasks. We filter and deduplicate aggressively.
- Safety: default policies align with common platform guidelines. You can tune or disable within your VPC.
- Terminal-Bench: agentic terminal coding benchmark. Replace with your link to the official repository or paper.
- Cost estimate: preview calculation comparing local Fantail inference (quantized, batch size 1) against published API list prices for a frontier model. Assumes identical tasks and response quality thresholds on Terminal‑Bench harness settings; excludes data egress and storage. Subject to revision after full reruns.
Methodology and Notes
1. Access policy. Customers in regulated industries (for example, cybersecurity and biological research) can work with their account teams to join our early-access allowlist for private endpoints and VPC deployment.
Terminal-Bench. All scores use the default agent framework (Terminus 2) with the XML parser enabled. We average multiple runs on different days to smooth sensitivity to inference infrastructure. Each run uses batch size 1, streaming enabled, no external browsing, and JSON/BNF-constrained tool arguments when applicable. Seeds vary per run and the harness commit is recorded. We report the mean pass rate and include error bars in the paper version.
Evaluation settings. Unless noted, Fantail runs use JSON mode or BNF-constrained decoding for tool calls, temperature tuned per task family, and a fixed top-p. Local runs include M2 Max and single T4 configurations; cloud runs use a Docker image with cuda-enabled backends. Logs include timestamps, seeds, and command-line flags. See the paper for exact commands and harness versions.
Baselines. Baseline numbers for non-Fantail models are sourced from public posts and leaderboards: vendor model pages, Terminal-Bench leaderboard (Terminus 2), and public tool-use leaderboards where available. Where a range exists, we choose the cited configuration most comparable to ours and note deviations in the paper.
.png)


