How to Set Up a Local LLM with Ollama for Development

Run capable LLMs on your own hardware in under 10 minutes. Privacy, speed, and zero per-token costs — here's how to set up Ollama for development.

Why Run LLMs Locally?

Running large language models locally gives you privacy, speed, and cost control. No API calls, no data leaving your machine, no per-token fees. Ollama makes this shockingly easy.

This guide gets you from zero to running a local model in under 10 minutes.

Step 1: Install Ollama

macOS / Linux:

bash

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer at ollama.com/download.

Verify it worked:

bash

ollama --version

Step 2: Pull Your First Model

Ollama's model library is available at ollama.com/library. Popular starting points:

llama3.2 — General purpose, good balance of speed and capability
codellama — Fine-tuned for code generation and explanation
mistral — Compact but surprisingly capable

Pull one with:

bash

ollama pull llama3.2

First pull downloads the model weights (several GB depending on size). Subsequent runs are near-instant.

Step 3: Run the Model

bash

ollama run llama3.2

You're in an interactive REPL. Type prompts, hit Enter, get responses. Exit with /bye or Ctrl+C.

Step 4: Use the API from Code

Ollama runs a local REST API on port 11434 by default. This makes it trivial to integrate into any project.

python

import requests
response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.2",
    "prompt": "Explain async/await in Python",
    "stream": False
})
print(response.json()["response"])

For a streaming response:

python

import requests
stream = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.2",
    "prompt": "Write a Python decorator",
}, stream=True)
for line in stream.iter_lines():
    if line:
        print(line.decode(), end="", flush=True)

Step 5: Keep Multiple Models

You can have several models installed simultaneously. Switch between them by name:

bash

ollama run codellama  # Switch to a code-specialized model

List what's installed:

bash

ollama list

Hardware Considerations

LLMs need RAM. As a rough guide:

7B parameter models — ~4-6 GB RAM minimum
13B parameter models — ~8-12 GB RAM minimum
Quantized models (q4_0, q5_1) use less memory at slight quality cost

MacBooks with M-series chips run these efficiently. On x86 Linux, an Nvidia GPU significantly speeds things up but isn't required.

A Real Workflow Example

Here's how I use this daily: I keep Ollama running in the background, and my editor shortcuts send selected code to the local API for explanation or refactoring suggestions. No context windows, no API costs, instant responses.

bash

# Pipe code directly to the model
echo "def fibonacci(n):" | ollama run llama3.2

What's Next

Once you're comfortable with the basics, explore:

Custom system prompts via environment variables
Modelfile configurations for fine-tuned control
Ollama's OpenAI-compatible server mode for drop-in API replacement

The barrier to running capable LLMs locally has never been lower. Give it 20 minutes and you'll wonder why you waited.