Stop paying per-token fees. Here's how to run powerful LLMs on your own hardware in under 10 minutes, with the workflows that actually matter once you're up and running.

Running Local LLMs Made Easy: A Practical Ollama Setup Guide

Let's be honest — calling an API for every little test is a pain. Latency, costs, rate limits, and the nagging feeling that you're shipping your prompts somewhere you shouldn't. Ollama fixes that.

It drops a full LLM runtime on your machine, no GPU rack required (though it helps), and gets you from zero to running in about as long as it takes to grab a coffee.

What You'll Need

A Mac, Linux box, or Windows with WSL2
8GB RAM minimum (16GB recommended for 7B models, 32GB for 13B+)
Optionally: an NVIDIA GPU with CUDA support

Step 1: Install It

bash

# macOS
brew install ollama
# Linux (one-liner)
curl -fsSL https://ollama.com/install.sh | sh
# Verify
ollama --version

That's it. No Docker, no Python environment juggling, no config files to hand-craft.

Step 2: Pull a Model

bash

# The lightweight workhorse — runs on most hardware
ollama pull llama3.2
# Want something beefier?
ollama pull llama3.2:13b
# Mistral if you prefer that vibe
ollama pull mistral
# Check what's on disk
ollama list

Model files land in ~/.ollama/models/ by default. Plan your storage accordingly.

Step 3: Run It

bash

ollama run llama3.2

You're dropped into an interactive prompt. Query it, paste code, ask questions. Hit Ctrl+D to exit.

For one-off calls without the REPL:

bash

ollama run llama3.2 "Explain async/await in three sentences"

Step 4: Wire It Into Your Workflow

The real value isn't the CLI — it's using it from code. Here's a Python example:

python

import ollama
response = ollama.chat(
    model='llama3.2',
    messages=[
        {"role": "system", "content": "You are a terse code reviewer."},
        {"role": "user", "content": "Why is this function slow?"},
    ]
)
print(response['message']['content'])

Or via the REST API that Ollama spins up automatically:

bash

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "What is retrieval-augmented generation?"}]
}'

This turns your local LLM into a backend service other tools can hit. That's where it gets interesting.

What Actually Matters Once You're Running

Context windows vary by model. llama3.2 supports 128K context — but longer context = more memory, and performance degrades past 32-48K in practice on consumer hardware.
Quantization matters more than model size. A well-quantized 7B model often beats a bloated 13B for daily use. Start with the smaller one.
Modelfiles let you lock in prompts. Think of them as a committed system prompt. ollama create review-gpt --f Modelfile builds a reusable configuration.

When to Use This vs. an API

Ollama shines for iteration, local experiments, data you don't want leaving your machine, and anything where latency matters. When you need guaranteed uptime, the latest model, or multimodal capabilities, an API still wins.

But for the everyday build-test-tweak loop? Your own GPU is faster and cheaper than it sounds.

Hardware requirements scale with model size. Start small, benchmark your workload, and size up only when you have evidence the bigger model earns its compute.