Stop paying per-token fees for development work. Here's how to get a production-quality LLM running on your own machine in under 20 minutes, with the exact setup I use every day.

How to Set Up a Local LLM in 20 Minutes with Ollama

Let me cut through the noise: running a local LLM isn't just for enthusiasts anymore. For development work — prompt iteration, testing, batch processing, anything where you're making API calls more than once — local inference is faster, cheaper, and gives you control that hosted models can't.

The setup I use every day is Ollama. It's the cleanest way to get a real LLM running on your machine with zero friction. Here's exactly how to do it.

What You're Getting Into

Ollama is a single binary that runs models locally. No cloud configuration, no Docker complexity, no API keys. You install it, you tell it which model to pull, and you have a local endpoint that speaks the OpenAI API protocol — meaning everything that works with GPT works with your local model.

This matters because the workflow is: write code against OpenAI API → swap the base URL to localhost → everything just works. I've been using this for six months. The latency is lower, the cost is zero, and I can run iterations overnight without burning budget.

Step 1: Install Ollama

On macOS:

bash

brew install ollama

On Linux:

bash

curl -fsSL https://ollama.com/install.sh | sh

On Windows: download the installer from ollama.com/download.

That's it. No dependencies, no configuration files, no environment variables yet.

Step 2: Pull a Model

The model choice matters more than people admit. Here's my practical breakdown:

Llama 3.1 8B — My daily driver. Good enough for most development tasks, fast on consumer hardware, minimal memory footprint. If you're on an M-series Mac or a machine with 16GB+ RAM, start here.

Llama 3.1 70B — If you need reasoning quality closer to GPT-4 and you have the hardware (48GB+ RAM or a high-end GPU). Slower, more expensive to run, but meaningfully better at complex chain-of-thought tasks.

Mistral 7B — Worth trying if Llama feels off for your use case. Different training data, different outputs. Sometimes the same prompt hits differently with a different base model.

Qwen 2.5 72B — If you need strong code generation and you're on a serious compute budget. I've seen it outperform Llama significantly on TypeScript and Python tasks.

To pull a model:

bash

ollama pull llama3.1

First pull takes a while — Ollama downloads the model weights, which are several gigabytes. After that, the model loads from local storage in seconds.

Step 3: Run It

bash

ollama run llama3.1

This opens an interactive terminal session. You can also run it as a server:

bash

ollama serve

The server runs on http://localhost:11434 by default. It's now speaking the OpenAI API format. You can talk to it with any OpenAI-compatible client by setting:

python

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain async/await in Python"}]
)
print(response.choices[0].message.content)

This is the part that makes Ollama genuinely useful. You don't rewrite your code. You just change the endpoint. Everything else is the same.

Step 4: GPU Acceleration (This Matters)

On macOS, Ollama automatically uses Metal GPU acceleration on M-series chips. You don't configure anything. On Linux with an NVIDIA GPU:

bash

ollama run llama3.1  # runs with GPU acceleration if available

If you're on Linux and not seeing GPU usage, install the NVIDIA CUDA runtime:

bash

curl -fsSL https://ollama.com/install.sh | sh

Ollama auto-detects your GPU. The speed difference is 5-10x on large models. If you're running 70B models without GPU acceleration, you're doing it wrong.

Step 5: Manage Models Like a Pro

One model running is fine. Three models for different tasks is better. Ollama has a built-in model management system:

bash

# List what's installed
ollama list
# Remove a model you don't need
ollama rm llama3.1
# Copy a model with a custom tag
ollama cp llama3.1 llama3.1-codedev

I keep three models installed at all times: a fast 8B for quick tasks, a medium 13B for development work, and a large 70B for complex reasoning. I swap between them depending on what I'm doing.

The Security Implication Nobody Talks About

Local models mean your data stays on your machine. This isn't paranoia — it's a real constraint for anyone working with proprietary code, customer data, or anything you don't want in a third-party's logs. When I work with sensitive prompts, I run them locally. No exceptions.

The tradeoff is real: local models are less capable than frontier models on hard tasks. But for the majority of development work — drafting emails, writing tests, explaining code, debugging — the gap is small enough that the privacy benefit wins.

What I Actually Use This For

My daily workflow: I have a terminal window permanently running ollama serve. When I need to test a prompt, I run it against the local model. When I need to process a batch of text, I write a Python script that calls the local endpoint. When I need the best quality and I'm willing to pay and wait, I switch to the hosted API.

The local model handles 80% of my AI-assisted work. The hosted API handles the 20% where I need frontier quality.

This is the setup that actually scales in practice. Not the all-local dream, not the all-hosted reality. The split that makes sense given what you actually need.

Ollama is free. The models are free. Your data stays on your machine. The only cost is the hardware you already own.