Let me cut through the noise: running a local LLM isn't just for enthusiasts anymore. For development work — prompt iteration, testing, batch processing, anything where you're making API calls more than once — local inference is faster, cheaper, and gives you control that hosted models can't.
The setup I use every day is Ollama. It's the cleanest way to get a real LLM running on your machine with zero friction. Here's exactly how to do it.
Ollama is a single binary that runs models locally. No cloud configuration, no Docker complexity, no API keys. You install it, you tell it which model to pull, and you have a local endpoint that speaks the OpenAI API protocol — meaning everything that works with GPT works with your local model.
This matters because the workflow is: write code against OpenAI API → swap the base URL to localhost → everything just works. I've been using this for six months. The latency is lower, the cost is zero, and I can run iterations overnight without burning budget.
On macOS:
brew install ollama
On Linux:
curl -fsSL https://ollama.com/install.sh | sh
On Windows: download the installer from [ollama.com/download](https://ollama.com/download).
That's it. No dependencies, no configuration files, no environment variables yet.
The model choice matters more than people admit. Here's my practical breakdown:
**Llama 3.1 8B** — My daily driver. Good enough for most development tasks, fast on consumer hardware, minimal memory footprint. If you're on an M-series Mac or a machine with 16GB+ RAM, start here.
**Llama 3.1 70B** — If you need reasoning quality closer to GPT-4 and you have the hardware (48GB+ RAM or a high-end GPU). Slower, more expensive to run, but meaningfully better at complex chain-of-thought tasks.
**Mistral 7B** — Worth trying if Llama feels off for your use case. Different training data, different outputs. Sometimes the same prompt hits differently with a different base model.
**Qwen 2.5 72B** — If you need strong code generation and you're on a serious compute budget. I've seen it outperform Llama significantly on TypeScript and Python tasks.
To pull a model:
ollama pull llama3.1
First pull takes a while — Ollama downloads the model weights, which are several gigabytes. After that, the model loads from local storage in seconds.
ollama run llama3.1
This opens an interactive terminal session. You can also run it as a server:
ollama serve
The server runs on `http://localhost:11434` by default. It's now speaking the OpenAI API format. You can talk to it with any OpenAI-compatible client by setting:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Explain async/await in Python"}]
)
print(response.choices[0].message.content)
This is the part that makes Ollama genuinely useful. You don't rewrite your code. You just change the endpoint. Everything else is the same.
On macOS, Ollama automatically uses Metal GPU acceleration on M-series chips. You don't configure anything. On Linux with an NVIDIA GPU:
ollama run llama3.1 # runs with GPU acceleration if available
If you're on Linux and not seeing GPU usage, install the NVIDIA CUDA runtime:
curl -fsSL https://ollama.com/install.sh | sh
Ollama auto-detects your GPU. The speed difference is 5-10x on large models. If you're running 70B models without GPU acceleration, you're doing it wrong.
One model running is fine. Three models for different tasks is better. Ollama has a built-in model management system:
ollama list
ollama rm llama3.1
ollama cp llama3.1 llama3.1-codedev
I keep three models installed at all times: a fast 8B for quick tasks, a medium 13B for development work, and a large 70B for complex reasoning. I swap between them depending on what I'm doing.
Local models mean your data stays on your machine. This isn't paranoia — it's a real constraint for anyone working with proprietary code, customer data, or anything you don't want in a third-party's logs. When I work with sensitive prompts, I run them locally. No exceptions.
The tradeoff is real: local models are less capable than frontier models on hard tasks. But for the majority of development work — drafting emails, writing tests, explaining code, debugging — the gap is small enough that the privacy benefit wins.
My daily workflow: I have a terminal window permanently running `ollama serve`. When I need to test a prompt, I run it against the local model. When I need to process a batch of text, I write a Python script that calls the local endpoint. When I need the best quality and I'm willing to pay and wait, I switch to the hosted API.
The local model handles 80% of my AI-assisted work. The hosted API handles the 20% where I need frontier quality.
This is the setup that actually scales in practice. Not the all-local dream, not the all-hosted reality. The split that makes sense given what you actually need.
*Ollama is free. The models are free. Your data stays on your machine. The only cost is the hardware you already own.*