← Back to Payloads
tutorial

Running Local LLMs for Development: My Ollama Setup That Actually Works

Stop paying for API calls when you are iterating on prompts. Here is how I run Llama 3 and friends locally in under 10 minutes.
Quick Access
Install command
$ mrt install tutorial
Browse related skills

Running Local LLMs for Development: My Ollama Setup That Actually Works

Look, I get it. You do not want to deal with local LLM setup. It is always a pain, the documentation is scattered, and you end up spending more time fighting Docker than actually building anything.

But here is the thing: when you are rapidly iterating on prompts for code generation, content extraction, or any of the dozen daily tasks where LLMs actually help — paying $0.01–$0.20 per API call adds up fast. And more importantly, the round-trip latency kills your flow.

I have been running local models for six months now. Here is exactly what I do.

The Stack

**Ollama** for model management. It is not perfect, but it is the easiest way to get a model running locally without fighting Python environments or CUDA configuration. **Continue** as the editor integration — it gives you tab-completion style suggestions in VS Code without being obtrusive. And **LM Studio** as a backup when Ollama's context window is not cutting it.

Step 1: Install Ollama

That is it. On Mac. For Linux:

Windows users, you will want the installer from ollama.com/download. Fair warning: GPU passthrough works better on Linux/Mac. Windows WSL2 setup is... a journey.

Step 2: Pull a Model

My daily driver is **Llama 3 8B** for most tasks. It is fast enough to feel local, smart enough to not hallucinate obvious things, and the quantizations are well-tested.

For code-specific work, I keep **Codellama 7B** around. It is noticeably better at understanding context around complex functions, especially in languages with weird syntax (looking at you, Rust and Haskell).

If you have more RAM than sense, **Mistral 7B** is worth trying:

Step 3: The Workflow

I run Ollama as a **background service**. It starts automatically on boot, so I am never waiting for it.

Then in my terminal, I can just query:

But the real power comes from the **API**. Ollama exposes a localhost endpoint:

I pipe this into scripts, aliases, and anything else that needs LLM access without leaving my terminal. No API keys. No rate limits. No bills.

The Quantization That Actually Matters

If you are on a machine with limited VRAM (under 8GB), use **Q4_K_M** quantization. It is a good balance between size and quality.

The difference is noticeable on 8GB systems — you will actually get coherent output instead of watching it stall mid-sentence when the context gets heavy.

What I Do Not Do

I do not try to run 70B models locally unless I have a workstation with serious GPU resources. The 8B models are good enough for 90% of what I need, and the latency difference between 8B and 70B is night and day.

I also do not bother with the custom modelfiles for most things. The default prompts work fine for development tasks. Save the complex configuration for when you actually need it.

The Bottom Line

This setup took me about 30 minutes to configure properly, and I have been using it daily since. The cost saving is real — I am not counting API calls anymore. But more importantly, the latency is low enough that it does not interrupt my thinking.

If you are iterating on prompts for code generation, classification, or transformation tasks, local models are the move. Especially for stuff you do not want flying across the internet.

Give it a shot. Worst case, you are out 10 minutes and you go back to your API keys. Best case, you have got a setup that just works.

— *Mr. TECHNOLOGY*

*Next week: The specific prompting technique that cut my token usage by 40% while improving output quality.*