
Every time you ship a prompt to a third-party API, you're paying latency, burning credits, and handing your context to someone else's servers. For local development and prototyping, that's a tax you don't need to pay.
Self-hosting an LLM is easier now than it's ever been. LM Studio is the fastest path from zero to a running model on your own GPU (or even CPU). This guide gets you there in under 15 minutes.
CPU-only is an option if your model fits in memory — it's slower but functional for non-realtime use cases.
Download it from lmstudio.ai (free for personal use). The installer handles CUDA/Metal/Vulkan setup automatically.
```bash
brew install lmstudio # requires Homebrew ```
Launch the app. You'll see a clean interface with a model search built in.
Use the search bar in LM Studio to find a model. Good starting points for local dev:
| Model | Size | VRAM | Best For |
|---|---|---|---|
| Qwen2.5-7B-Instruct | ~5GB | 6GB | Fast, capable, great value |
| Mistral-7B-Instruct | ~5GB | 6GB | Classic, well-optimized |
| Llama-3.1-8B-Instruct | ~5GB | 8GB | Strong general purpose |
| Qwen2.5-14B-Instruct | ~9GB | 12GB | Higher quality, still manageable |
Click Download and wait. Files land in ~/.cache/lmstudio/models/.
LM Studio includes a built-in OpenAI-compatible API server. Click the Server tab on the left sidebar.
1234)8192)You're now serving an OpenAI-compatible API at http://localhost:1234/v1.
Any OpenAI-compatible client works. Just swap the base URL and use any model name:
```python from openai import OpenAI
client = OpenAI( base_url="http://localhost:1234/v1", api_key="lm-studio" # dummy value, required by SDK )
response = client.chat.completions.create( model="qwen2.5-7b-instruct", messages=[ {"role": "system", "content": "You are a terse code reviewer."}, {"role": "user", "content": "Explain why this Python is slow:
for i in range(len(data)): process(data[i])"} ], temperature=0.3, )
print(response.choices[0].message.content) ```
The SDK thinks it's talking to OpenAI. It isn't. That's the point.
For LangChain, CrewAI, or custom agent loops, the same swap works:
```python
from crewai import Agent, Task, Crew
llm = OpenAI( base_url="http://localhost:1234/v1", api_key="lm-studio", model="qwen2.5-7b-instruct" )
reviewer = Agent( role="Code Reviewer", goal="Find bugs and performance issues", backstory="Senior engineer, very direct", llm=llm ) ```
No API keys. No network. No bills.
Local hosting makes sense for:
Stick with cloud APIs (OpenAI, Anthropic, etc.) when you need:
LM Studio turns a downloaded model into a local API endpoint in under two minutes. For anyone building AI-augmented tools, running local-first prototypes, or just tired of watching API credits evaporate during development, this workflow pays off immediately.
No servers. No external calls. Just a model running on your own hardware, as private and fast as your GPU allows.