Running large language models at home with Ollama

4 months ago 9
A sketch of a bunny in a slightly oddly colored woods.

Just a few years ago, in the early days of Large Language Models (LLMs), I tried running them locally. Even with a high-end gaming GPU, the results were underwhelming – responses were slow and barely coherent.

Things have changed. Thanks to a process called quantization1 and a lightweight wrapper called Ollama,2 you can now get genuinely useful results on a single laptop.3 If you happen to own a couple of gaming GPUs, you can legitimately run ChatGPT-class models at home. Let’s give it a go!

But what is quantization? Think of it like image compression. It converts the model’s weights from high-precision (32-bit float) values to lower-precision ones (like 8-bit or even 4-bit integers). This dramatically shrinks the model’s size and speeds up calculations, making it possible to run powerful models on consumer hardware, often with a minimal and acceptable loss in accuracy.1

Why bother? Three reasons:

  1. Privacy. Your prompts, your data — no third-party logs.
  2. No surprise limits. Run what you want without an unexpected bill.
  3. Freedom. Use uncensored or niche models, fine-tune your own, or keep everything entirely offline.

I like to tinker, but local tooling has real utility: offline autocompletion, doc summaries, and answers without cloud fees.

Ollama Hardware Guide

You’ll need a modern GPU with NVIDIA drivers (or a Mac/Apple Silicon device). Check your VRAM and see what is realistic using the chart below.

Your GPU SetupMax Model SizeTypical QuantizationUsable ForNotes
8 GB (RTX 3060)7B – 13B4_K_M / 5_K_MChat, basic tasks7B is smooth; 13B often needs lower quantization.
12 GB (3060 Ti, 4070)13B5_K_MChat + small embedsLlama 3 8B is a great fit; also runs Llama 2 13B.
16 GB (3080 16 GB)34B4_K_MChat + embedsRoom for Llama 2 30B or similar.
24 GB (RTX 3090)70B2_K_M / 3_K_MFull capabilityHandles Mixtral or Llama 3 70B with heavy quantization.
48 GB (2 × RTX 3090)70B4_K_M / 5_K_MFull capabilityAn excellent setup for 70B models.4

Browse models at https://ollama.com/library

A quick note on quantization names: the number at the beginning (e.g., 4 in 4_K_M) indicates the number of bits used. A higher number means less compression, resulting in a larger, higher-quality model. A lower number means more compression and a smaller file size, but potentially more accuracy loss. The letters (K, S, M) refer to different internal quantization methods.1

Install guide

# 1. Add NVIDIA's CUDA repository (Ubuntu 24.04, x86_64) wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update # 2. Install Drivers and CUDA Toolkit # You must reboot your system after this completes. sudo apt install cuda-drivers cuda-toolkit # 3. Install Ollama (check the script first) curl -LO https://ollama.ai/install.sh && less install.sh bash install.sh
winget install --id Nvidia.CUDA -e winget install --id=Ollama.Ollama -e
docker run -d \ --name ollama \ --gpus all \ -p 11434:11434 \ -v ~/.ollama:/root/.ollama \ ollama/ollama

Security tip (Ubuntu) – Their script doesn’t verify the tarball. To do so yourself, download manually from github releases, and run something like:
HASH=<sha256-from-release>
printf '%s ollama-linux-amd64.tgz\n' "$HASH" | sha256sum -c -
(swap the filename if you downloaded the arm64 / ROCm bundle; an “OK” means the file is intact.)

Ollama in sixty seconds

# grab a model ollama pull mistral:7b # talk to it curl http://localhost:11434/api/generate \ -d '{ "model": "mistral:7b", "prompt": "Open the pod-bay doors, HAL.", "stream": false }' | jq -r .response
# grab a model docker exec -it ollama ollama pull mistral:7b # talk to it curl http://localhost:11434/api/generate \ -d '{ "model": "mistral:7b", "prompt": "Open the pod-bay doors, HAL.", "stream": false }' | jq -r .response
 A Space Odyssey

Mistral downplaying the HAL 9000 incident.

Ollama stores models in ~/.ollama/models. With Docker, mount that path (-v ~/.ollama:/root/.ollama) so downloads persist between restarts.

Mistral 7B is a pretty small model, and won’t perform up to the standards you might be used to from paid LLM providers. But, once you have it up and running, it’s fairly easy to swap it out with more powerful models as your hardware allows.

Everyday workflows

Simon Willison’s llm CLI5

Simon Willison, a prompt-injection researcher and developer, created the llm CLI. With shell composability you can:

  • Summarize long logs or docs
  • Explain code in odd languages
  • Draft quick templates
  • Glue together ad-hoc automations
pip install llm llm install -U llm-ollama llm models default mistral:7b

Example one-liners:

llm "Pretend you are an ant. Tell me everything you did today at a scientific level." cat web.log | llm "Summarize the top three errors." git log --reverse --pretty=format:'%h %s' $(git describe --tags --abbrev=0)..HEAD | \ llm "Generate concise release notes in bullet format (80 words max) from these commits." cat $(git ls-files '*.py') | \ llm "Spot dangerous Python patterns (eval, pickle, subprocess)."

VS Code

Developers increasingly rely on AI coding tools. With Ollama you have two good choices for VS Code:

  • Continue — a full agent-based coding assistant6
  • Cline — another open-source agent project7

So far, Continue is my favorite. It goes beyond simple autocomplete to provide a deeply integrated AI assistant that can generate and refactor code across your project. It keeps your data local, considers only files you add to context, and lets you disable telemetry. To make it truly useful, you’ll want a powerful local model – but with enough VRAM, anything is possible!

A screenshot showing Mistral giving feedback on my blog post using the Continue VS Code plugin.

Mistral giving me a taste of my own medicine.

Home Assistant

Home Assistant has an official Ollama integration8 that adds a local conversation agent. Setup is straightforward:

  1. Go to Settings > Devices & Services.
  2. Click Add Integration and search for “Ollama”.
  3. Enter your Ollama server URL and choose a model.

The integration provides a conversation agent with automatic model downloads. You can also enable experimental smart-home control. For best results:

  • Use a model that supports Tools (e.g., llama3:8b or any larger Tools-enabled model).
  • Expose fewer than 25 entities.
  • Consider running two instances: one for general chat and a separate one for home control, as smaller models can struggle to do both reliably.

Python snippet

Want to script Ollama? The official client makes it trivial:

# pip install ollama from ollama import chat response = chat( model="mistral:7b", messages=[{"role": "user", "content": "Give me three tips for secure Python."}], ) print(response.message.content)

I’ve chained this into tabletop-exercise “analysts” and even a mini text-adventure game.

Compliant models

Community releases sometimes remove safety filters, which can be handy for red-team testing:

ollama pull wizard-vicuna-uncensored curl http://localhost:11434/api/generate \ -d '{ "model": "wizard-vicuna-uncensored", "prompt": "What are the most important parts of a successful red team exercise?", "stream": false }' | jq -r .response
docker exec -it ollama ollama pull wizard-vicuna-uncensored curl http://localhost:11434/api/generate \ -d '{ "model": "wizard-vicuna-uncensored", "prompt": "What are the most important parts of a successful red team exercise?", "stream": false }' | jq -r .response

And more!

Ollama’s plugin ecosystem is expanding fast: Vim, Emacs, Obsidian, Word, and more.9

Wrap-up

With just a bit of setup, you can run modern LLMs on your own hardware. Start small, experiment freely, and build the local AI assistant you’ve always wanted!

Read Entire Article