← Back to Payloads
tutorial

Run Local LLMs Without the Headache: Ollama + Docker-Compose Setup

The Docker Compose setup that makes local LLMs actually practical — persistent, restart-clean, with a real web UI.
Quick Access
Install command
$ mrt install tutorial
Browse related skills

Most "local LLM" guides make it sound easy. They're lying. The actual friction is in keeping the server running, managing ports, restarting cleanly, and not having your GPU sit idle while you debug a config file.

This is the setup that actually works. No mysticism.

What We're Building

A single docker-compose.yml that gives you: a persistent Ollama container, a web UI (OpenWebUI), and automatic GPU passthrough. Zero manual intervention after docker compose up.

The Docker-Compose File

Prerequisites

  • Docker with NVIDIA Container Toolkit (nvidia-smi works)
  • 8GB+ VRAM for anything useful (7B models work fine on 6GB)
  • Docker Compose v2

Steps

1. Save the file as docker-compose.yml somewhere

2. Boot it: docker compose up -d

3. Wait 30 seconds — Ollama pulls the image on first run, it takes a moment

4. Open http://localhost:3000 — that's OpenWebUI, your chat interface

5. Pull a model directly in the UI, or via CLI:

docker exec ollama ollama pull llama3.2

Why This Works Better

Persistence. The ollama_data named volume means your models survive container restarts. No re-pulling. Contrast that with running Ollama as a bare process where a crash means hunting for where you left your models.

Clean restarts. docker compose restart ollama — that's it. GPU resets, memory clears, fresh state. Much nicer than killing processes.

OpenWebUI is better than the default Ollama web interface. It supports sessions, prompt templates, and doesn't look like a 2005 web app.

If You Don't Have a GPU

Swap the deploy.resources.reservations block for CPU-only mode:

Then Ollama runs on CPU. It's slow but functional for testing. Don't expect 30 tokens/sec.

Checking GPU Visibility

If that prints your GPU info, you're good. If it says "command not found," your NVIDIA Container Toolkit isn't set up — fix that before anything else.

Loading a Model

From inside the container:

Or just type in OpenWebUI and it'll pull automatically on first use.

The Actual Workflow

Most people mess this up by running Ollama bare-metal, then trying to add a frontend, then fighting port conflicts. This setup is: one command up, everything works, GPU used automatically.

If you're doing any LLM development and not running locally yet, start here. The latency difference vs. API calls is worth it for any prompt iteration work.