Stop paying per-token fees for tasks that don't need a frontier model. This guide walks through containerizing Ollama, choosing the right model for your hardware, and integrating it into your development workflow — with the gotchas they don't tell you in the README.

Running Ollama in Docker: A Practical Local LLM Setup Guide

Ollama made local LLM deployment accessible. Docker made it reproducible. Put them together and you have a dev setup that's fast to spin up, fast to tear down, and doesn't require you to trust a third party with your data.

This isn't a "hello world" guide. It's the setup you'd write for your team after you've already made the mistakes.

Prerequisites

You'll need Docker installed (obviously), and enough RAM to run the model you choose. A 7B model needs ~8GB of RAM minimum. A 70B model needs ~64GB. Don't try to run a 70B model on a laptop with 16GB — you will be sad.

Step 1: Pull the Docker Image

Ollama publishes official Docker images. The GPU-enabled image is ollama/ollama, which includes CUDA support if you have an NVIDIA GPU. If you're on Apple Silicon, the standard image runs natively on the M-series chips.

bash

# For NVIDIA GPUs
docker pull ollama/ollama:latest
# Verify it works
docker run -it ollama/ollama --version

The GPU image is significantly larger but handles model inference at near-native speeds. The CPU-only image works fine for smaller models on machines without discrete GPUs.

Step 2: Run Ollama as a Container

The simplest startup:

bash

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama

This gives you:

A daemon running in the background (-d)
Port 11434 exposed for API access
A named volume so your models survive container restarts

If you have an NVIDIA GPU and want to use it:

bash

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama

The --gpus all flag requires the NVIDIA Container Toolkit. If you haven't installed it, the Ollama docs have straightforward instructions for Ubuntu, Debian, and Arch.

Step 3: Pull Your First Model

The model you choose matters more than the framework you wrap around it. Ollama's library has most of the models you'd want, with Llama 3.1 (8B and 70B), Mistral, Phi, and Gemma available in variants optimized for different hardware profiles.

For most development tasks, Llama 3.1 8B is the right starting point:

bash

# Pull a model (this downloads several GB)
docker exec ollama ollama pull llama3.1:8b

Verify it's available:

bash

docker exec ollama ollama list

If you're on limited hardware and 8B is too heavy, phi3:3.8b runs surprisingly well for the size and is useful for lighter tasks like text classification or summarization.

Step 4: Talk to the API

Ollama exposes a REST API and a WebSocket endpoint for streaming responses. The basic completion endpoint:

bash

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain why a linked list has O(1) insertion but O(n) search",
  "stream": false
}'

For streaming (better for interactive applications):

bash

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Write a Python decorator that retries on exceptions",
  "stream": true
}'

The streaming response comes back as JSON lines, each containing a partial token. Parsing that is straightforward — most HTTP client libraries have utilities for this.

Step 5: Integrate into Your App

For Python:

python

import requests
def complete(prompt: str, model: str = "llama3.1:8b"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

For Node.js:

javascript

async function complete(prompt, model = "llama3.1:8b") {
  const response = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({ model, prompt, stream: false })
  });
  const data = await response.json();
  return data.response;
}

Both are intentionally simple. Ollama's API is not complicated, and you don't need a client library to use it.

Common Gotchas

Model not found after container restart. Ollama stores models in the volume, but the container needs to be able to see that volume. If you docker rm and recreate without -v ollama:/root/.ollama, your models are gone. The volume persists; the container doesn't carry state.

GPU not detected inside container. The --gpus all flag requires the NVIDIA Container Toolkit. If nvidia-smi works on the host but not inside the container, you probably haven't installed it. The Ollama docs have the exact install steps for your distro.

8GB RAM machine runs out of memory. Use a quantized model. Q4_0 (4-bit quantization) runs in roughly half the memory of the full precision model with acceptable quality loss for most tasks. Specify the quantized variant explicitly: ollama pull llama3.1:8b-q4_0.

Port 11434 already in use. Something else is running on that port. Check with lsof -i :11434 or just change the mapping with -p 11435:11434 and point your client at that port instead.

What to Expect After Setup

You'll have a local LLM that responds in 2-8 seconds per completion depending on your hardware, handles most code generation and text tasks at quality comparable to GPT-4 from two years ago, and doesn't send your data anywhere. For prototyping, internal tooling, and tasks where latency isn't critical, it's the right tradeoff.

The next thing to set up is a simple web UI if you want one — Ollama has an official one called Open WebUI that runs as a separate Docker container and connects to the API you've already configured. But that's a different guide.

Ollama v0.5+ running in Docker. GPU-accelerated image for NVIDIA and Apple Silicon. Tested with Llama 3.1 8B Q4_0 on 16GB RAM. API at localhost:11434.