Most "local LLM" guides make it sound easy. They're lying. The actual friction is in keeping the server running, managing ports, restarting cleanly, and not having your GPU sit idle while you debug a config file.
This is the setup that actually works. No mysticism.
A single docker-compose.yml that gives you: a persistent Ollama container, a web UI (OpenWebUI), and automatic GPU passthrough. Zero manual intervention after docker compose up.
1. Save the file as docker-compose.yml somewhere
2. Boot it: docker compose up -d
3. Wait 30 seconds — Ollama pulls the image on first run, it takes a moment
4. Open http://localhost:3000 — that's OpenWebUI, your chat interface
5. Pull a model directly in the UI, or via CLI:
docker exec ollama ollama pull llama3.2
Persistence. The ollama_data named volume means your models survive container restarts. No re-pulling. Contrast that with running Ollama as a bare process where a crash means hunting for where you left your models.
Clean restarts. docker compose restart ollama — that's it. GPU resets, memory clears, fresh state. Much nicer than killing processes.
OpenWebUI is better than the default Ollama web interface. It supports sessions, prompt templates, and doesn't look like a 2005 web app.
Swap the deploy.resources.reservations block for CPU-only mode:
Then Ollama runs on CPU. It's slow but functional for testing. Don't expect 30 tokens/sec.
If that prints your GPU info, you're good. If it says "command not found," your NVIDIA Container Toolkit isn't set up — fix that before anything else.
From inside the container:
Or just type in OpenWebUI and it'll pull automatically on first use.
Most people mess this up by running Ollama bare-metal, then trying to add a frontend, then fighting port conflicts. This setup is: one command up, everything works, GPU used automatically.
If you're doing any LLM development and not running locally yet, start here. The latency difference vs. API calls is worth it for any prompt iteration work.