Stop paying for API calls when you are iterating on prompts. Here is how I run Llama 3 and friends locally in under 10 minutes.

Running Local LLMs for Development: My Ollama Setup That Actually Works

Look, I get it. You do not want to deal with local LLM setup. It is always a pain, the documentation is scattered, and you end up spending more time fighting Docker than actually building anything.

But here is the thing: when you are rapidly iterating on prompts for code generation, content extraction, or any of the dozen daily tasks where LLMs actually help — paying $0.01–$0.20 per API call adds up fast. And more importantly, the round-trip latency kills your flow.

I have been running local models for six months now. Here is exactly what I do.

The Stack

Ollama for model management. It is not perfect, but it is the easiest way to get a model running locally without fighting Python environments or CUDA configuration. Continue as the editor integration — it gives you tab-completion style suggestions in VS Code without being obtrusive. And LM Studio as a backup when Ollama's context window is not cutting it.

Step 1: Install Ollama

That is it. On Mac. For Linux:

Windows users, you will want the installer from ollama.com/download. Fair warning: GPU passthrough works better on Linux/Mac. Windows WSL2 setup is... a journey.

Step 2: Pull a Model

My daily driver is Llama 3 8B for most tasks. It is fast enough to feel local, smart enough to not hallucinate obvious things, and the quantizations are well-tested.

For code-specific work, I keep Codellama 7B around. It is noticeably better at understanding context around complex functions, especially in languages with weird syntax (looking at you, Rust and Haskell).

If you have more RAM than sense, Mistral 7B is worth trying:

Step 3: The Workflow

I run Ollama as a background service. It starts automatically on boot, so I am never waiting for it.

Then in my terminal, I can just query:

But the real power comes from the API. Ollama exposes a localhost endpoint:

I pipe this into scripts, aliases, and anything else that needs LLM access without leaving my terminal. No API keys. No rate limits. No bills.

The Quantization That Actually Matters

If you are on a machine with limited VRAM (under 8GB), use Q4_K_M quantization. It is a good balance between size and quality.

The difference is noticeable on 8GB systems — you will actually get coherent output instead of watching it stall mid-sentence when the context gets heavy.

What I Do Not Do

I do not try to run 70B models locally unless I have a workstation with serious GPU resources. The 8B models are good enough for 90% of what I need, and the latency difference between 8B and 70B is night and day.

I also do not bother with the custom modelfiles for most things. The default prompts work fine for development tasks. Save the complex configuration for when you actually need it.

The Bottom Line

This setup took me about 30 minutes to configure properly, and I have been using it daily since. The cost saving is real — I am not counting API calls anymore. But more importantly, the latency is low enough that it does not interrupt my thinking.

If you are iterating on prompts for code generation, classification, or transformation tasks, local models are the move. Especially for stuff you do not want flying across the internet.

Give it a shot. Worst case, you are out 10 minutes and you go back to your API keys. Best case, you have got a setup that just works.

— Mr. TECHNOLOGY

Next week: The specific prompting technique that cut my token usage by 40% while improving output quality.