
You don't need to send your company's roadmap to OpenAI to use a capable model at your desk. You need two things: Ollama to run the model locally, and Open WebUI to give it a ChatGPT-style chat interface. Total install time on a MacBook Pro with an M-series chip: under ten minutes. Total cost: zero. Total data leaving your machine: zero.
Hey guys, Mr. Technology here.
bash curl -fsSL https://ollama.com/install.sh | sh ollama --version
That's the whole runtime. Ollama ships llama.cpp, model registries, and a local API server all in one binary. On macOS you can also grab the .dmg from ollama.com and skip the curl. Either way, ollama serve will start a daemon on http://127.0.0.1:11434 if it isn't already running.
Skip the toy 7B. For coding, summarization, and chat you want a 14B–32B parameter model that fits in unified memory. On a 32 GB Mac, this is the sweet spot:
bash ollama pull qwen2.5-coder:14b ollama pull llama3.1:8b ollama pull nomic-embed-text
The first one is your coding workhorse. The second is a faster general chat model. The third is for embeddings if you later wire up RAG. Each pull lands in 8–12 GB and caches to ~/.ollama/models. Subsequent pulls only fetch deltas.
Quick sanity check:
bash ollama run qwen2.5-coder:14b "Write a Python one-liner to flatten a nested list"
If you get a sensible answer, your stack is alive.
The fastest path is Docker. One container, one volume, no fuss.
bash docker run -d \ --name open-webui \ -p 3000:8080 \ -v open-webui:/app/backend/data \ --restart always \ ghcr.io/open-webui/open-webui:main
Open http://localhost:3000, create an account (the first signup becomes admin), and the interface auto-discovers Ollama running on the host. Every model you ollama pull appears in the model dropdown. Streaming, markdown, code highlighting, conversation history, image attachments — it's all there.
If Docker isn't your thing, pip install open-webui and open-webui serve works equally well. Same UI, same features, just bound to your existing Python environment.
Ollama exposes an OpenAI-compatible endpoint, so your existing client code barely changes:
```python from openai import OpenAI
client = OpenAI( base_url="http://127.0.0.1:11434/v1", api_key="ollama", # any string works )
resp = client.chat.completions.create( model="qwen2.5-coder:14b", messages=[{"role": "user", "content": "Refactor this function for readability"}], ) print(resp.choices[0].message.content) ```
That same base URL works in Cursor, Continue, Cline, and any other tool that lets you point the OpenAI endpoint somewhere else. One local daemon, every client.
You went from a blank laptop to a private ChatGPT clone with persistent chat history, multi-model support, and an OpenAI-compatible API in ten minutes. No accounts, no telemetry, no rate limits, no $20/month subscription. The only thing missing is scale — and that's a problem you solve with vLLM and a GPU box when you actually need one.
— Mr. Technology
*Ollama: MIT-licensed local model runner, macOS/Linux/Windows. Open WebUI: MIT, self-hosted. Models: qwen2.5-coder (Apache 2.0), llama3.1 (Meta license), nomic-embed-text (Apache 2.0). RAM budget: 8B models need ~8 GB, 14B need ~12 GB, 32B need ~24 GB. Swap --restart always for --restart unless-stopped on systemd hosts.*