Stop Ollama Unloading Your Model: The keep_alive Trick

Tired of waiting 15 seconds for Ollama to reload llama3 every time you step away? One tiny env var kills that cold-start penalty for good. Here's the line, the gotchas, and the per-request override.

What You Need to Know: Ollama silently unloads your model from VRAM after 5 minutes of idle. On a 13B quant that's a 10–20 second cold-start penalty every time you come back. Set the OLLAMA_KEEP_ALIVE env var to -1 and the model stays resident until you kill the server. One line, big quality-of-life win.

Hey guys, Mr. Technology here. I want to share a tiny trick that's been quietly wrecking my local LLM workflow for months until I finally dug into it.

The Symptom

I was running a coding agent loop against a local qwen2.5-coder:14b on my dev box. I'd kick off a task, walk away to grab coffee, come back, and the next agent turn would hang for 15+ seconds before producing a single token. First request of every "session" was a slug.

I assumed it was my SSD. It wasn't.

The Cause

Ollama's default behavior is to unload a model from VRAM after 5 minutes of idle time. So every time I came back from coffee, the model had to be re-mmap'd from disk into GPU memory before it could serve a single token. With a 14B Q4_K_M that's roughly 9 GB of file → 9 GB of VRAM, and my box isn't fast about it.

The Fix

Two ways to handle this. Pick the one that matches your setup.

1. Server-side (the real fix): Set the env var on the Ollama server process itself.

bash

# Linux/macOS — make it permanent in your shell rc
export OLLAMA_KEEP_ALIVE=-1
# Or systemd override
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_KEEP_ALIVE=-1"

-1 means "never unload." The model stays in VRAM until you systemctl stop ollama or kill the process. You can also use 30m, 2h, or any Go duration string if you want a longer-but-finite window.

2. Per-request (the surgical fix): If you can't restart the server (shared box, Docker, whatever), pass keep_alive on the request itself.

bash

curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:14b",
  "prompt": "Write a quicksort in Python.",
  "keep_alive": -1,
  "stream": false
}'

The same flag works in the Python client:

python

import ollama
ollama.generate(
    model="qwen2.5-coder:14b",
    prompt="Write a quicksort in Python.",
    keep_alive=-1,
)

The Gotchas I Hit

Setting it via the API only persists for that one request. I burned an hour wondering why my "setting" wasn't sticking. The env var is the only thing that changes the server default.
**keep_alive: 0 unloads the model immediately.** Useful for batch jobs — load, infer, evict, free VRAM for the next model.
Watch your VRAM. If you're juggling multiple models, -1 on each will OOM you. I run -1 for my primary model and 5m (the default) for the rest.
Restart matters. A running Ollama daemon won't pick up a changed OLLAMA_KEEP_ALIVE until you bounce the service.

Before / After

	Cold-start on first request	Subsequent requests
Before (default 5m)	12–18s wait	normal
After (`KEEP_ALIVE=-1`)	instant	normal

For an agent loop that issues dozens of small calls per minute, that 5-minute idle window is constantly expiring between my turns. Killing it dropped my wall-clock task time on a 20-step refactor from 6 minutes to 2.

If you're running Ollama for anything more interactive than a one-shot script, set this and forget it. One line, no downside on a single-model box.

What do you think? Drop your thoughts in the comments below!