← Back to Payloads
const tags: string[] = [2026-05-21

How to Set Up vLLM for Local Inference

Set up vLLM for local inference in under 30 minutes. Practical guide with real commands and benchmarks.
Quick Access
Install command
$ mrt install how-to-set-up-vllm-for-local-inference
Browse related skills
How to Set Up vLLM for Local Inference

vLLM is the fastest way to run open-source LLMs on your own hardware. If you're still suffering through slow inference times or paying cloud bills, this setup will change that. Here's how to get it running in under 30 minutes.

What You Need

Linux with NVIDIA GPU (VRAM 16GB+ recommended), CUDA 12.1+, Python 3.10+. I use Ubuntu 22.04 and a 4090. If you're on macOS, skip to the Ollama article — vLLM doesn't do Metal yet.

# Check your CUDA version

nvidia-smi | head -3

nvcc --version

Step 1: Create a Clean Environment

Don't install vLLM in your base Python. Use a virtual environment.

python3 -m venv vllm-env

source vllm-env/bin/activate

Step 2: Install vLLM

The official way is via pip. It downloads pre-built wheels for most CUDA versions.

pip install vllm

If you need the latest features (or the wheel isn't available for your CUDA version), build from source:

git clone https://github.com/vllm-project/vllm.git

cd vllm

pip install -e .

Build time: 20-40 minutes on a decent machine. Go make coffee.

Step 3: Pull a Model

Use HuggingFace. For a balance of speed and capability, Llama 3.1 8B is my default recommendation.

# Install HF transfer tool for large models

pip install huggingface_hub

Download the model

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \

--local-dir ./models/llama-3.1-8b-instruct \

--token $HF_TOKEN

Set your token first: export HF_TOKEN=hf_your_token_here. Get one free at huggingface.co.

Step 4: Run Inference

Here's the payoff. A simple Python script:

from vllm import LLM, SamplingParams

llm = LLM(model="./models/llama-3.1-8b-instruct")

sampling_params = SamplingParams(

t temperature=0.7,

top_p=0.95,

max_tokens=512

)

outputs = llm.generate(["Explain why vLLM is faster than llama.cpp"], sampling_params)

print(outputs[0].outputs[0].text)

First load is slow (model loading to GPU). Subsequent generations are fast — we're talking 30-60 tokens/second on a 4090 depending on the model and batch size.

Step 5: OpenAI-Compatible Server

Want to use any OpenAI-compatible client? vLLM ships a server out of the box:

vllm serve ./models/llama-3.1-8b-instruct \

--host 0.0.0.0 \

--port 8000 \

--tensor-parallel-size 1

Now you can hit http://localhost:8000/v1/chat/completions with the same payloads you'd send to OpenAI. Switch your app's base URL and you're done. No code changes needed for most frameworks.

Benchmark Before You Trust

vLLM's throughput claim isn't marketing — PagedAttention genuinely reduces memory waste. But run your own benchmark to see the numbers for your specific model and batch size:

python -m vllm.benchmark.latency \

--model ./models/llama-3.1-8b-instruct \

--num-runs 100

Compare that to whatever you're currently using. The difference is usually 2-4x for long context batches.

Common Issues

CUDA out of memory: Reduce max_model_len or use a smaller model. 8B fits in 16GB VRAM with proper config.

Slow first generation: Normal. That's the profiling pass vLLM does on first run to optimize the attention kernels.

Model not found: You're likely hitting a gated model without setting your HF token. import os; os.environ["HF_TOKEN"] = "hf_..."

vLLM handles the hard parts. You bring the hardware. Get it running this week and you'll never go back to cloud-only inference.

Related Dispatches
Put this into production