LLMs Are Moving Local – So Why Are We Still Paying for Tokens?

12 hours ago 1

Is anyone still using LLM APIs?

Open models like SmolLM3 (~3B) and Qwen2-1.5B are getting surprisingly capable - and they run fine on laptops or even phones. With Apple rolling out on-device LLMs in iOS 18, it feels like we’re entering a real local-first phase.

Small models already handle focused jobs: lightweight copilots, captioning, inspection.

And not just text - Gemma 2 2B Vision and Qwen2-VL can caption and reason about images locally.

Hardware’s there too: Apple’s M-series Neural Engine hits ~133 TOPS, and consumer GPUs chew through 4-8B models.

Tooling’s catching up fast: * Ollama for local runtimes (GGUF, simple CLI) * Cactus / RunLocal for mobile * ExecuTorch / LiteRT for on-device inference

Still some pain: iOS memory limits, packaging overhead, distillation quirks. Quantization helps, but 4-bit isn’t magic.

The upside’s clear: privacy by default, offline by design, zero latency, no token bills.

The cloud won’t die, but local compute finally feels fun again.

What’s keeping small models from going fully on-device?

Read Entire Article