← Back to Payloads
2026-05-11

How to Self-Host an Open-Source LLM for Local Development

Skip the API bills and latency. Here's how to run a capable open-source LLM entirely on your own hardware using LM Studio — and integrate it into your agentic workflows.
Quick Access
Install command
$ mrt install Tutorial
Browse related skills
How to Self-Host an Open-Source LLM for Local Development

Every time you ship a prompt to a third-party API, you're paying latency, burning credits, and handing your context to someone else's servers. For local development and prototyping, that's a tax you don't need to pay.

Self-hosting an LLM is easier now than it's ever been. LM Studio is the fastest path from zero to a running model on your own GPU (or even CPU). This guide gets you there in under 15 minutes.

What You'll Need

  • A machine with a decent GPU (6GB+ VRAM recommended for 7B models, 12GB+ for 13B+)
  • macOS, Windows, or Linux
  • ~10GB of free disk space for the model

CPU-only is an option if your model fits in memory — it's slower but functional for non-realtime use cases.

Step 1 - Install LM Studio

Download it from lmstudio.ai (free for personal use). The installer handles CUDA/Metal/Vulkan setup automatically.

```bash

macOS / Linux - if you prefer the CLI

brew install lmstudio # requires Homebrew ```

Launch the app. You'll see a clean interface with a model search built in.

Step 2 - Download a Model

Use the search bar in LM Studio to find a model. Good starting points for local dev:

ModelSizeVRAMBest For
Qwen2.5-7B-Instruct~5GB6GBFast, capable, great value
Mistral-7B-Instruct~5GB6GBClassic, well-optimized
Llama-3.1-8B-Instruct~5GB8GBStrong general purpose
Qwen2.5-14B-Instruct~9GB12GBHigher quality, still manageable

Click Download and wait. Files land in ~/.cache/lmstudio/models/.

Step 3 - Run a Local Server

LM Studio includes a built-in OpenAI-compatible API server. Click the Server tab on the left sidebar.

  • Set a port (default: 1234)
  • Set a context length (e.g., 8192)
  • Load your model (GPU slider controls offload - move it right to use your VRAM)
  • Hit Start Server

You're now serving an OpenAI-compatible API at http://localhost:1234/v1.

Step 4 - Point Your Code at It

Any OpenAI-compatible client works. Just swap the base URL and use any model name:

```python from openai import OpenAI

client = OpenAI( base_url="http://localhost:1234/v1", api_key="lm-studio" # dummy value, required by SDK )

response = client.chat.completions.create( model="qwen2.5-7b-instruct", messages=[ {"role": "system", "content": "You are a terse code reviewer."}, {"role": "user", "content": "Explain why this Python is slow:

for i in range(len(data)): process(data[i])"} ], temperature=0.3, )

print(response.choices[0].message.content) ```

The SDK thinks it's talking to OpenAI. It isn't. That's the point.

Step 5 - Use It with Your Agent Framework

For LangChain, CrewAI, or custom agent loops, the same swap works:

```python

CrewAI example

from crewai import Agent, Task, Crew

llm = OpenAI( base_url="http://localhost:1234/v1", api_key="lm-studio", model="qwen2.5-7b-instruct" )

reviewer = Agent( role="Code Reviewer", goal="Find bugs and performance issues", backstory="Senior engineer, very direct", llm=llm ) ```

No API keys. No network. No bills.

Performance Tips

  • GPU offloading: Move the slider all the way right in the LM Studio server tab. CPU inference on a 7B model is 5-10x slower.
  • Quantization: GGUF format models (what LM Studio downloads) are already quantized. Smaller quantizations (Q4_K_M, Q5_K_S) save VRAM at acceptable quality loss.
  • Batch size: Increase the batch size in server settings if you're running many concurrent requests.
  • Context length: 8192 is a sweet spot. Going higher costs VRAM fast.

When to Use This vs. Cloud APIs

Local hosting makes sense for:

  • Development and testing - iterate fast, no API cost
  • Privacy-sensitive code - code, customer data, internal docs never leave your machine
  • High-volume, low-stakes tasks - bulk processing, batch reviews, data transformation

Stick with cloud APIs (OpenAI, Anthropic, etc.) when you need:

  • The absolute best model quality for your use case
  • Elastic scaling with no hardware constraints
  • Built-in safety/content filtering at scale

Wrapping Up

LM Studio turns a downloaded model into a local API endpoint in under two minutes. For anyone building AI-augmented tools, running local-first prototypes, or just tired of watching API credits evaporate during development, this workflow pays off immediately.

No servers. No external calls. Just a model running on your own hardware, as private and fast as your GPU allows.

Related Dispatches