
Just a few years ago, in the early days of Large Language Models (LLMs), I tried running them locally. Even with a high-end gaming GPU, the results were underwhelming – responses were slow and barely coherent.
Things have changed. Thanks to a process called quantization1 and a lightweight wrapper called Ollama,2 you can now get genuinely useful results on a single laptop.3 If you happen to own a couple of gaming GPUs, you can legitimately run ChatGPT-class models at home. Let’s give it a go!
But what is quantization? Think of it like image compression. It converts the model’s weights from high-precision (32-bit float) values to lower-precision ones (like 8-bit or even 4-bit integers). This dramatically shrinks the model’s size and speeds up calculations, making it possible to run powerful models on consumer hardware, often with a minimal and acceptable loss in accuracy.1
Why bother? Three reasons:
- Privacy. Your prompts, your data — no third-party logs.
- No surprise limits. Run what you want without an unexpected bill.
- Freedom. Use uncensored or niche models, fine-tune your own, or keep everything entirely offline.
I like to tinker, but local tooling has real utility: offline autocompletion, doc summaries, and answers without cloud fees.
Ollama Hardware Guide
You’ll need a modern GPU with NVIDIA drivers (or a Mac/Apple Silicon device). Check your VRAM and see what is realistic using the chart below.
| 8 GB (RTX 3060) | 7B – 13B | 4_K_M / 5_K_M | Chat, basic tasks | 7B is smooth; 13B often needs lower quantization. |
| 12 GB (3060 Ti, 4070) | 13B | 5_K_M | Chat + small embeds | Llama 3 8B is a great fit; also runs Llama 2 13B. |
| 16 GB (3080 16 GB) | 34B | 4_K_M | Chat + embeds | Room for Llama 2 30B or similar. |
| 24 GB (RTX 3090) | 70B | 2_K_M / 3_K_M | Full capability | Handles Mixtral or Llama 3 70B with heavy quantization. |
| 48 GB (2 × RTX 3090) | 70B | 4_K_M / 5_K_M | Full capability | An excellent setup for 70B models.4 |
Browse models at https://ollama.com/library
A quick note on quantization names: the number at the beginning (e.g., 4 in 4_K_M) indicates the number of bits used. A higher number means less compression, resulting in a larger, higher-quality model. A lower number means more compression and a smaller file size, but potentially more accuracy loss. The letters (K, S, M) refer to different internal quantization methods.1
Install guide
Security tip (Ubuntu) – Their script doesn’t verify the tarball.
To do so yourself, download manually from github releases, and run something like:
HASH=<sha256-from-release>
printf '%s ollama-linux-amd64.tgz\n' "$HASH" | sha256sum -c -
(swap the filename if you downloaded the arm64 / ROCm bundle; an “OK” means the file is intact.)
Ollama in sixty seconds

Mistral downplaying the HAL 9000 incident.
Ollama stores models in ~/.ollama/models. With Docker, mount that path (-v ~/.ollama:/root/.ollama) so downloads persist between restarts.
Mistral 7B is a pretty small model, and won’t perform up to the standards you might be used to from paid LLM providers. But, once you have it up and running, it’s fairly easy to swap it out with more powerful models as your hardware allows.
Everyday workflows
Simon Willison’s llm CLI5
Simon Willison, a prompt-injection researcher and developer, created the llm CLI. With shell composability you can:
- Summarize long logs or docs
- Explain code in odd languages
- Draft quick templates
- Glue together ad-hoc automations
Example one-liners:
VS Code
Developers increasingly rely on AI coding tools. With Ollama you have two good choices for VS Code:
So far, Continue is my favorite. It goes beyond simple autocomplete to provide a deeply integrated AI assistant that can generate and refactor code across your project. It keeps your data local, considers only files you add to context, and lets you disable telemetry. To make it truly useful, you’ll want a powerful local model – but with enough VRAM, anything is possible!

Mistral giving me a taste of my own medicine.
Home Assistant
Home Assistant has an official Ollama integration8 that adds a local conversation agent. Setup is straightforward:
- Go to Settings > Devices & Services.
- Click Add Integration and search for “Ollama”.
- Enter your Ollama server URL and choose a model.
The integration provides a conversation agent with automatic model downloads. You can also enable experimental smart-home control. For best results:
- Use a model that supports Tools (e.g., llama3:8b or any larger Tools-enabled model).
- Expose fewer than 25 entities.
- Consider running two instances: one for general chat and a separate one for home control, as smaller models can struggle to do both reliably.
Python snippet
Want to script Ollama? The official client makes it trivial:
I’ve chained this into tabletop-exercise “analysts” and even a mini text-adventure game.
Compliant models
Community releases sometimes remove safety filters, which can be handy for red-team testing:
And more!
Ollama’s plugin ecosystem is expanding fast: Vim, Emacs, Obsidian, Word, and more.9
Wrap-up
With just a bit of setup, you can run modern LLMs on your own hardware. Start small, experiment freely, and build the local AI assistant you’ve always wanted!
.png)


![Educating the Next Generation of Open Source Project Contributors [video]](https://www.youtube.com/img/desktop/supported_browsers/chrome.png)