← Back to Payloads
Tutorial2026-05-27

Running Local LLMs Made Easy: A Practical Ollama Setup Guide

Stop paying per-token fees. Here's how to run powerful LLMs on your own hardware in under 10 minutes, with the workflows that actually matter once you're up and running.
Quick Access
Install command
$ mrt install tutorial
Browse related skills

Running Local LLMs Made Easy: A Practical Ollama Setup Guide

Let's be honest — calling an API for every little test is a pain. Latency, costs, rate limits, and the nagging feeling that you're shipping your prompts somewhere you shouldn't. Ollama fixes that.

It drops a full LLM runtime on your machine, no GPU rack required (though it helps), and gets you from zero to running in about as long as it takes to grab a coffee.

What You'll Need

  • A Mac, Linux box, or Windows with WSL2
  • 8GB RAM minimum (16GB recommended for 7B models, 32GB for 13B+)
  • Optionally: an NVIDIA GPU with CUDA support

Step 1: Install It

macOS

brew install ollama

Linux (one-liner)

curl -fsSL https://ollama.com/install.sh | sh

Verify

ollama --version

That's it. No Docker, no Python environment juggling, no config files to hand-craft.

Step 2: Pull a Model

The lightweight workhorse — runs on most hardware

ollama pull llama3.2

Want something beefier?

ollama pull llama3.2:13b

Mistral if you prefer that vibe

ollama pull mistral

Check what's on disk

ollama list

Model files land in `~/.ollama/models/` by default. Plan your storage accordingly.

Step 3: Run It

ollama run llama3.2

You're dropped into an interactive prompt. Query it, paste code, ask questions. Hit `Ctrl+D` to exit.

For one-off calls without the REPL:

ollama run llama3.2 "Explain async/await in three sentences"

Step 4: Wire It Into Your Workflow

The real value isn't the CLI — it's using it from code. Here's a Python example:

import ollama

response = ollama.chat(

model='llama3.2',

messages=[

{"role": "system", "content": "You are a terse code reviewer."},

{"role": "user", "content": "Why is this function slow?"},

]

)

print(response['message']['content'])

Or via the REST API that Ollama spins up automatically:

curl http://localhost:11434/api/chat -d '{

"model": "llama3.2",

"messages": [{"role": "user", "content": "What is retrieval-augmented generation?"}]

}'

This turns your local LLM into a backend service other tools can hit. That's where it gets interesting.

What Actually Matters Once You're Running

  • **Context windows vary by model.** llama3.2 supports 128K context — but longer context = more memory, and performance degrades past 32-48K in practice on consumer hardware.
  • **Quantization matters more than model size.** A well-quantized 7B model often beats a bloated 13B for daily use. Start with the smaller one.
  • **Modelfiles let you lock in prompts.** Think of them as a committed system prompt. `ollama create review-gpt --f Modelfile` builds a reusable configuration.

When to Use This vs. an API

Ollama shines for iteration, local experiments, data you don't want leaving your machine, and anything where latency matters. When you need guaranteed uptime, the latest model, or multimodal capabilities, an API still wins.

But for the everyday build-test-tweak loop? Your own GPU is faster and cheaper than it sounds.

*Hardware requirements scale with model size. Start small, benchmark your workload, and size up only when you have evidence the bigger model earns its compute.*