
What You Need to Know: Ollama silently unloads your model from VRAM after 5 minutes of idle. On a 13B quant that's a 10–20 second cold-start penalty every time you come back. Set theOLLAMA_KEEP_ALIVEenv var to-1and the model stays resident until you kill the server. One line, big quality-of-life win.
Hey guys, Mr. Technology here. I want to share a tiny trick that's been quietly wrecking my local LLM workflow for months until I finally dug into it.
I was running a coding agent loop against a local qwen2.5-coder:14b on my dev box. I'd kick off a task, walk away to grab coffee, come back, and the next agent turn would hang for 15+ seconds before producing a single token. First request of every "session" was a slug.
I assumed it was my SSD. It wasn't.
Ollama's default behavior is to unload a model from VRAM after 5 minutes of idle time. So every time I came back from coffee, the model had to be re-mmap'd from disk into GPU memory before it could serve a single token. With a 14B Q4_K_M that's roughly 9 GB of file → 9 GB of VRAM, and my box isn't fast about it.
Two ways to handle this. Pick the one that matches your setup.
1. Server-side (the real fix): Set the env var on the Ollama server process itself.
# Linux/macOS — make it permanent in your shell rc export OLLAMA_KEEP_ALIVE=-1 # Or systemd override sudo systemctl edit ollama # Add: # [Service] # Environment="OLLAMA_KEEP_ALIVE=-1"
-1 means "never unload." The model stays in VRAM until you systemctl stop ollama or kill the process. You can also use 30m, 2h, or any Go duration string if you want a longer-but-finite window.
2. Per-request (the surgical fix): If you can't restart the server (shared box, Docker, whatever), pass keep_alive on the request itself.
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:14b",
"prompt": "Write a quicksort in Python.",
"keep_alive": -1,
"stream": false
}'The same flag works in the Python client:
import ollama
ollama.generate(
model="qwen2.5-coder:14b",
prompt="Write a quicksort in Python.",
keep_alive=-1,
)keep_alive: 0 unloads the model immediately.** Useful for batch jobs — load, infer, evict, free VRAM for the next model.-1 on each will OOM you. I run -1 for my primary model and 5m (the default) for the rest.OLLAMA_KEEP_ALIVE until you bounce the service.| Cold-start on first request | Subsequent requests | |
|---|---|---|
| Before (default 5m) | 12–18s wait | normal |
**After (KEEP_ALIVE=-1)** | instant | normal |
For an agent loop that issues dozens of small calls per minute, that 5-minute idle window is constantly expiring between my turns. Killing it dropped my wall-clock task time on a 20-step refactor from 6 minutes to 2.
If you're running Ollama for anything more interactive than a one-shot script, set this and forget it. One line, no downside on a single-model box.
What do you think? Drop your thoughts in the comments below!