Click **Download** and wait. Files land in `~/.cache/lmstudio/models/`.
Step 3 - Run a Local Server
LM Studio includes a built-in **OpenAI-compatible API server**. Click the **Server** tab on the left sidebar.
- Set a **port** (default: `1234`)
- Set a **context length** (e.g., `8192`)
- Load your model (GPU slider controls offload - move it right to use your VRAM)
- Hit **Start Server**
You're now serving an OpenAI-compatible API at `http://localhost:1234/v1`.
Step 4 - Point Your Code at It
Any OpenAI-compatible client works. Just swap the base URL and use any model name:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio" # dummy value, required by SDK
)
response = client.chat.completions.create(
model="qwen2.5-7b-instruct",
messages=[
{"role": "system", "content": "You are a terse code reviewer."},
{"role": "user", "content": "Explain why this Python is slow:
for i in range(len(data)):
process(data[i])"}
],
temperature=0.3,
)
print(response.choices[0].message.content)
The SDK thinks it's talking to OpenAI. It isn't. That's the point.
Step 5 - Use It with Your Agent Framework
For LangChain, CrewAI, or custom agent loops, the same swap works:
CrewAI example
from crewai import Agent, Task, Crew
llm = OpenAI(
base_url="http://localhost:1234/v1",
api_key="lm-studio",
model="qwen2.5-7b-instruct"
)
reviewer = Agent(
role="Code Reviewer",
goal="Find bugs and performance issues",
backstory="Senior engineer, very direct",
llm=llm
)
No API keys. No network. No bills.
Performance Tips
- **GPU offloading**: Move the slider all the way right in the LM Studio server tab. CPU inference on a 7B model is 5-10x slower.
- **Quantization**: GGUF format models (what LM Studio downloads) are already quantized. Smaller quantizations (Q4_K_M, Q5_K_S) save VRAM at acceptable quality loss.
- **Batch size**: Increase the batch size in server settings if you're running many concurrent requests.
- **Context length**: 8192 is a sweet spot. Going higher costs VRAM fast.
When to Use This vs. Cloud APIs
Local hosting makes sense for:
- **Development and testing** - iterate fast, no API cost
- **Privacy-sensitive code** - code, customer data, internal docs never leave your machine
- **High-volume, low-stakes tasks** - bulk processing, batch reviews, data transformation
Stick with cloud APIs (OpenAI, Anthropic, etc.) when you need:
- The absolute best model quality for your use case
- Elastic scaling with no hardware constraints
- Built-in safety/content filtering at scale
Wrapping Up
LM Studio turns a downloaded model into a local API endpoint in under two minutes. For anyone building AI-augmented tools, running local-first prototypes, or just tired of watching API credits evaporate during development, this workflow pays off immediately.
No servers. No external calls. Just a model running on your own hardware, as private and fast as your GPU allows.