Let's be honest — calling an API for every little test is a pain. Latency, costs, rate limits, and the nagging feeling that you're shipping your prompts somewhere you shouldn't. Ollama fixes that.
It drops a full LLM runtime on your machine, no GPU rack required (though it helps), and gets you from zero to running in about as long as it takes to grab a coffee.
brew install ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
That's it. No Docker, no Python environment juggling, no config files to hand-craft.
ollama pull llama3.2
ollama pull llama3.2:13b
ollama pull mistral
ollama list
Model files land in `~/.ollama/models/` by default. Plan your storage accordingly.
ollama run llama3.2
You're dropped into an interactive prompt. Query it, paste code, ask questions. Hit `Ctrl+D` to exit.
For one-off calls without the REPL:
ollama run llama3.2 "Explain async/await in three sentences"
The real value isn't the CLI — it's using it from code. Here's a Python example:
import ollama
response = ollama.chat(
model='llama3.2',
messages=[
{"role": "system", "content": "You are a terse code reviewer."},
{"role": "user", "content": "Why is this function slow?"},
]
)
print(response['message']['content'])
Or via the REST API that Ollama spins up automatically:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "What is retrieval-augmented generation?"}]
}'
This turns your local LLM into a backend service other tools can hit. That's where it gets interesting.
Ollama shines for iteration, local experiments, data you don't want leaving your machine, and anything where latency matters. When you need guaranteed uptime, the latest model, or multimodal capabilities, an API still wins.
But for the everyday build-test-tweak loop? Your own GPU is faster and cheaper than it sounds.
*Hardware requirements scale with model size. Start small, benchmark your workload, and size up only when you have evidence the bigger model earns its compute.*