← Back to Payloads
tutorial

How to Set Up a Local LLM in 20 Minutes with Ollama

Stop paying per-token fees for development work. Here's how to get a production-quality LLM running on your own machine in under 20 minutes, with the exact setup I use every day.
Quick Access
Install command
$ mrt install ollama
Browse related skills

How to Set Up a Local LLM in 20 Minutes with Ollama

Let me cut through the noise: running a local LLM isn't just for enthusiasts anymore. For development work — prompt iteration, testing, batch processing, anything where you're making API calls more than once — local inference is faster, cheaper, and gives you control that hosted models can't.

The setup I use every day is Ollama. It's the cleanest way to get a real LLM running on your machine with zero friction. Here's exactly how to do it.

What You're Getting Into

Ollama is a single binary that runs models locally. No cloud configuration, no Docker complexity, no API keys. You install it, you tell it which model to pull, and you have a local endpoint that speaks the OpenAI API protocol — meaning everything that works with GPT works with your local model.

This matters because the workflow is: write code against OpenAI API → swap the base URL to localhost → everything just works. I've been using this for six months. The latency is lower, the cost is zero, and I can run iterations overnight without burning budget.

Step 1: Install Ollama

On macOS:

brew install ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows: download the installer from [ollama.com/download](https://ollama.com/download).

That's it. No dependencies, no configuration files, no environment variables yet.

Step 2: Pull a Model

The model choice matters more than people admit. Here's my practical breakdown:

**Llama 3.1 8B** — My daily driver. Good enough for most development tasks, fast on consumer hardware, minimal memory footprint. If you're on an M-series Mac or a machine with 16GB+ RAM, start here.

**Llama 3.1 70B** — If you need reasoning quality closer to GPT-4 and you have the hardware (48GB+ RAM or a high-end GPU). Slower, more expensive to run, but meaningfully better at complex chain-of-thought tasks.

**Mistral 7B** — Worth trying if Llama feels off for your use case. Different training data, different outputs. Sometimes the same prompt hits differently with a different base model.

**Qwen 2.5 72B** — If you need strong code generation and you're on a serious compute budget. I've seen it outperform Llama significantly on TypeScript and Python tasks.

To pull a model:

ollama pull llama3.1

First pull takes a while — Ollama downloads the model weights, which are several gigabytes. After that, the model loads from local storage in seconds.

Step 3: Run It

ollama run llama3.1

This opens an interactive terminal session. You can also run it as a server:

ollama serve

The server runs on `http://localhost:11434` by default. It's now speaking the OpenAI API format. You can talk to it with any OpenAI-compatible client by setting:

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:11434/v1",

api_key="ollama" # required but ignored

)

response = client.chat.completions.create(

model="llama3.1",

messages=[{"role": "user", "content": "Explain async/await in Python"}]

)

print(response.choices[0].message.content)

This is the part that makes Ollama genuinely useful. You don't rewrite your code. You just change the endpoint. Everything else is the same.

Step 4: GPU Acceleration (This Matters)

On macOS, Ollama automatically uses Metal GPU acceleration on M-series chips. You don't configure anything. On Linux with an NVIDIA GPU:

ollama run llama3.1 # runs with GPU acceleration if available

If you're on Linux and not seeing GPU usage, install the NVIDIA CUDA runtime:

curl -fsSL https://ollama.com/install.sh | sh

Ollama auto-detects your GPU. The speed difference is 5-10x on large models. If you're running 70B models without GPU acceleration, you're doing it wrong.

Step 5: Manage Models Like a Pro

One model running is fine. Three models for different tasks is better. Ollama has a built-in model management system:

List what's installed

ollama list

Remove a model you don't need

ollama rm llama3.1

Copy a model with a custom tag

ollama cp llama3.1 llama3.1-codedev

I keep three models installed at all times: a fast 8B for quick tasks, a medium 13B for development work, and a large 70B for complex reasoning. I swap between them depending on what I'm doing.

The Security Implication Nobody Talks About

Local models mean your data stays on your machine. This isn't paranoia — it's a real constraint for anyone working with proprietary code, customer data, or anything you don't want in a third-party's logs. When I work with sensitive prompts, I run them locally. No exceptions.

The tradeoff is real: local models are less capable than frontier models on hard tasks. But for the majority of development work — drafting emails, writing tests, explaining code, debugging — the gap is small enough that the privacy benefit wins.

What I Actually Use This For

My daily workflow: I have a terminal window permanently running `ollama serve`. When I need to test a prompt, I run it against the local model. When I need to process a batch of text, I write a Python script that calls the local endpoint. When I need the best quality and I'm willing to pay and wait, I switch to the hosted API.

The local model handles 80% of my AI-assisted work. The hosted API handles the 20% where I need frontier quality.

This is the setup that actually scales in practice. Not the all-local dream, not the all-hosted reality. The split that makes sense given what you actually need.

*Ollama is free. The models are free. Your data stays on your machine. The only cost is the hardware you already own.*