← Back to Payloads
Tutorial2026-05-27

Running Local LLMs Made Easy: A Practical Ollama Setup Guide

Stop paying per-token fees. Here's how to run powerful LLMs on your own hardware in under 10 minutes, with the workflows that actually matter once you're up and running.
Quick Access
Install command
$ mrt install tutorial
Browse related skills
Running Local LLMs Made Easy: A Practical Ollama Setup Guide

Running Local LLMs Made Easy: A Practical Ollama Setup Guide

Let's be honest — calling an API for every little test is a pain. Latency, costs, rate limits, and the nagging feeling that you're shipping your prompts somewhere you shouldn't. Ollama fixes that.

It drops a full LLM runtime on your machine, no GPU rack required (though it helps), and gets you from zero to running in about as long as it takes to grab a coffee.

What You'll Need

  • A Mac, Linux box, or Windows with WSL2
  • 8GB RAM minimum (16GB recommended for 7B models, 32GB for 13B+)
  • Optionally: an NVIDIA GPU with CUDA support

Step 1: Install It

```bash

macOS

brew install ollama

Linux (one-liner)

curl -fsSL https://ollama.com/install.sh | sh

Verify

ollama --version ```

That's it. No Docker, no Python environment juggling, no config files to hand-craft.

Step 2: Pull a Model

```bash

The lightweight workhorse — runs on most hardware

ollama pull llama3.2

Want something beefier?

ollama pull llama3.2:13b

Mistral if you prefer that vibe

ollama pull mistral

Check what's on disk

ollama list ```

Model files land in ~/.ollama/models/ by default. Plan your storage accordingly.

Step 3: Run It

bash ollama run llama3.2

You're dropped into an interactive prompt. Query it, paste code, ask questions. Hit Ctrl+D to exit.

For one-off calls without the REPL:

bash ollama run llama3.2 "Explain async/await in three sentences"

Step 4: Wire It Into Your Workflow

The real value isn't the CLI — it's using it from code. Here's a Python example:

```python import ollama

response = ollama.chat( model='llama3.2', messages=[ {"role": "system", "content": "You are a terse code reviewer."}, {"role": "user", "content": "Why is this function slow?"}, ] )

print(response['message']['content']) ```

Or via the REST API that Ollama spins up automatically:

bash curl http://localhost:11434/api/chat -d '{ "model": "llama3.2", "messages": [{"role": "user", "content": "What is retrieval-augmented generation?"}] }'

This turns your local LLM into a backend service other tools can hit. That's where it gets interesting.

What Actually Matters Once You're Running

  • Context windows vary by model. llama3.2 supports 128K context — but longer context = more memory, and performance degrades past 32-48K in practice on consumer hardware.
  • Quantization matters more than model size. A well-quantized 7B model often beats a bloated 13B for daily use. Start with the smaller one.
  • Modelfiles let you lock in prompts. Think of them as a committed system prompt. ollama create review-gpt --f Modelfile builds a reusable configuration.

When to Use This vs. an API

Ollama shines for iteration, local experiments, data you don't want leaving your machine, and anything where latency matters. When you need guaranteed uptime, the latest model, or multimodal capabilities, an API still wins.

But for the everyday build-test-tweak loop? Your own GPU is faster and cheaper than it sounds.


Hardware requirements scale with model size. Start small, benchmark your workload, and size up only when you have evidence the bigger model earns its compute.

Related Dispatches