← Back to Payloads

How to Self-Host an Open-Source LLM for Local Development

Skip the API bills and latency. Here's how to run a capable open-source LLM entirely on your own hardware using LM Studio — and integrate it into your agentic workflows.

Every time you ship a prompt to a third-party API, you're paying latency, burning credits, and handing your context to someone else's servers. For local development and prototyping, that's a tax you don't need to pay.

Self-hosting an LLM is easier now than it's ever been. LM Studio is the fastest path from zero to a running model on your own GPU (or even CPU). This guide gets you there in under 15 minutes.

What You'll Need

  • A machine with a decent GPU (6GB+ VRAM recommended for 7B models, 12GB+ for 13B+)
  • macOS, Windows, or Linux
  • ~10GB of free disk space for the model

CPU-only is an option if your model fits in memory — it's slower but functional for non-realtime use cases.

Step 1 - Install LM Studio

Download it from [lmstudio.ai](https://lmstudio.ai) (free for personal use). The installer handles CUDA/Metal/Vulkan setup automatically.

macOS / Linux - if you prefer the CLI

brew install lmstudio # requires Homebrew

Launch the app. You'll see a clean interface with a model search built in.

Step 2 - Download a Model

Use the search bar in LM Studio to find a model. Good starting points for local dev:

ModelSizeVRAMBest For
**Qwen2.5-7B-Instruct**~5GB6GBFast, capable, great value
**Mistral-7B-Instruct**~5GB6GBClassic, well-optimized
**Llama-3.1-8B-Instruct**~5GB8GBStrong general purpose

Click **Download** and wait. Files land in `~/.cache/lmstudio/models/`.

Step 3 - Run a Local Server

LM Studio includes a built-in **OpenAI-compatible API server**. Click the **Server** tab on the left sidebar.

  • Set a **port** (default: `1234`)
  • Set a **context length** (e.g., `8192`)
  • Load your model (GPU slider controls offload - move it right to use your VRAM)
  • Hit **Start Server**

You're now serving an OpenAI-compatible API at `http://localhost:1234/v1`.

Step 4 - Point Your Code at It

Any OpenAI-compatible client works. Just swap the base URL and use any model name:

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:1234/v1",

api_key="lm-studio" # dummy value, required by SDK

)

response = client.chat.completions.create(

model="qwen2.5-7b-instruct",

messages=[

{"role": "system", "content": "You are a terse code reviewer."},

{"role": "user", "content": "Explain why this Python is slow:

for i in range(len(data)):

process(data[i])"}

],

temperature=0.3,

)

print(response.choices[0].message.content)

The SDK thinks it's talking to OpenAI. It isn't. That's the point.

Step 5 - Use It with Your Agent Framework

For LangChain, CrewAI, or custom agent loops, the same swap works:

CrewAI example

from crewai import Agent, Task, Crew

llm = OpenAI(

base_url="http://localhost:1234/v1",

api_key="lm-studio",

model="qwen2.5-7b-instruct"

)

reviewer = Agent(

role="Code Reviewer",

goal="Find bugs and performance issues",

backstory="Senior engineer, very direct",

llm=llm

)

No API keys. No network. No bills.

Performance Tips

  • **GPU offloading**: Move the slider all the way right in the LM Studio server tab. CPU inference on a 7B model is 5-10x slower.
  • **Quantization**: GGUF format models (what LM Studio downloads) are already quantized. Smaller quantizations (Q4_K_M, Q5_K_S) save VRAM at acceptable quality loss.
  • **Batch size**: Increase the batch size in server settings if you're running many concurrent requests.
  • **Context length**: 8192 is a sweet spot. Going higher costs VRAM fast.

When to Use This vs. Cloud APIs

Local hosting makes sense for:

  • **Development and testing** - iterate fast, no API cost
  • **Privacy-sensitive code** - code, customer data, internal docs never leave your machine
  • **High-volume, low-stakes tasks** - bulk processing, batch reviews, data transformation

Stick with cloud APIs (OpenAI, Anthropic, etc.) when you need:

  • The absolute best model quality for your use case
  • Elastic scaling with no hardware constraints
  • Built-in safety/content filtering at scale

Wrapping Up

LM Studio turns a downloaded model into a local API endpoint in under two minutes. For anyone building AI-augmented tools, running local-first prototypes, or just tired of watching API credits evaporate during development, this workflow pays off immediately.

No servers. No external calls. Just a model running on your own hardware, as private and fast as your GPU allows.

**Qwen2.5-14B-Instruct**~9GB12GBHigher quality, still manageable