
Ollama made local LLM deployment accessible. Docker made it reproducible. Put them together and you have a dev setup that's fast to spin up, fast to tear down, and doesn't require you to trust a third party with your data.
This isn't a "hello world" guide. It's the setup you'd write for your team after you've already made the mistakes.
You'll need Docker installed (obviously), and enough RAM to run the model you choose. A 7B model needs ~8GB of RAM minimum. A 70B model needs ~64GB. Don't try to run a 70B model on a laptop with 16GB — you will be sad.
Ollama publishes official Docker images. The GPU-enabled image is ollama/ollama, which includes CUDA support if you have an NVIDIA GPU. If you're on Apple Silicon, the standard image runs natively on the M-series chips.
```bash
docker pull ollama/ollama:latest
docker run -it ollama/ollama --version ```
The GPU image is significantly larger but handles model inference at near-native speeds. The CPU-only image works fine for smaller models on machines without discrete GPUs.
The simplest startup:
bash docker run -d \ --name ollama \ -p 11434:11434 \ -v ollama:/root/.ollama \ ollama/ollama
This gives you:
-d)If you have an NVIDIA GPU and want to use it:
bash docker run -d \ --name ollama \ --gpus all \ -p 11434:11434 \ -v ollama_data:/root/.ollama \ ollama/ollama
The --gpus all flag requires the NVIDIA Container Toolkit. If you haven't installed it, the Ollama docs have straightforward instructions for Ubuntu, Debian, and Arch.
The model you choose matters more than the framework you wrap around it. Ollama's library has most of the models you'd want, with Llama 3.1 (8B and 70B), Mistral, Phi, and Gemma available in variants optimized for different hardware profiles.
For most development tasks, Llama 3.1 8B is the right starting point:
```bash
docker exec ollama ollama pull llama3.1:8b ```
Verify it's available:
bash docker exec ollama ollama list
If you're on limited hardware and 8B is too heavy, phi3:3.8b runs surprisingly well for the size and is useful for lighter tasks like text classification or summarization.
Ollama exposes a REST API and a WebSocket endpoint for streaming responses. The basic completion endpoint:
bash curl http://localhost:11434/api/generate -d '{ "model": "llama3.1:8b", "prompt": "Explain why a linked list has O(1) insertion but O(n) search", "stream": false }'
For streaming (better for interactive applications):
bash curl http://localhost:11434/api/generate -d '{ "model": "llama3.1:8b", "prompt": "Write a Python decorator that retries on exceptions", "stream": true }'
The streaming response comes back as JSON lines, each containing a partial token. Parsing that is straightforward — most HTTP client libraries have utilities for this.
For Python:
```python import requests
def complete(prompt: str, model: str = "llama3.1:8b"): response = requests.post( "http://localhost:11434/api/generate", json={"model": model, "prompt": prompt, "stream": False} ) return response.json()["response"] ```
For Node.js:
javascript async function complete(prompt, model = "llama3.1:8b") { const response = await fetch("http://localhost:11434/api/generate", { method: "POST", headers: {"Content-Type": "application/json"}, body: JSON.stringify({ model, prompt, stream: false }) }); const data = await response.json(); return data.response; }
Both are intentionally simple. Ollama's API is not complicated, and you don't need a client library to use it.
Model not found after container restart. Ollama stores models in the volume, but the container needs to be able to see that volume. If you docker rm and recreate without -v ollama:/root/.ollama, your models are gone. The volume persists; the container doesn't carry state.
GPU not detected inside container. The --gpus all flag requires the NVIDIA Container Toolkit. If nvidia-smi works on the host but not inside the container, you probably haven't installed it. The Ollama docs have the exact install steps for your distro.
8GB RAM machine runs out of memory. Use a quantized model. Q4_0 (4-bit quantization) runs in roughly half the memory of the full precision model with acceptable quality loss for most tasks. Specify the quantized variant explicitly: ollama pull llama3.1:8b-q4_0.
Port 11434 already in use. Something else is running on that port. Check with lsof -i :11434 or just change the mapping with -p 11435:11434 and point your client at that port instead.
You'll have a local LLM that responds in 2-8 seconds per completion depending on your hardware, handles most code generation and text tasks at quality comparable to GPT-4 from two years ago, and doesn't send your data anywhere. For prototyping, internal tooling, and tasks where latency isn't critical, it's the right tradeoff.
The next thing to set up is a simple web UI if you want one — Ollama has an official one called Open WebUI that runs as a separate Docker container and connects to the API you've already configured. But that's a different guide.
Ollama v0.5+ running in Docker. GPU-accelerated image for NVIDIA and Apple Silicon. Tested with Llama 3.1 8B Q4_0 on 16GB RAM. API at localhost:11434.