vLLM is the fastest way to run open-source LLMs on your own hardware. If you're still suffering through slow inference times or paying cloud bills, this setup will change that. Here's how to get it running in under 30 minutes.
Linux with NVIDIA GPU (VRAM 16GB+ recommended), CUDA 12.1+, Python 3.10+. I use Ubuntu 22.04 and a 4090. If you're on macOS, skip to the Ollama article — vLLM doesn't do Metal yet.
# Check your CUDA versionnvidia-smi | head -3
nvcc --version
Don't install vLLM in your base Python. Use a virtual environment.
python3 -m venv vllm-envsource vllm-env/bin/activate
The official way is via pip. It downloads pre-built wheels for most CUDA versions.
pip install vllmIf you need the latest features (or the wheel isn't available for your CUDA version), build from source:
git clone https://github.com/vllm-project/vllm.gitcd vllm
pip install -e .
Build time: 20-40 minutes on a decent machine. Go make coffee.
Use HuggingFace. For a balance of speed and capability, Llama 3.1 8B is my default recommendation.
# Install HF transfer tool for large modelspip install huggingface_hub
Download the model
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir ./models/llama-3.1-8b-instruct \
--token $HF_TOKEN
Set your token first: export HF_TOKEN=hf_your_token_here. Get one free at huggingface.co.
Here's the payoff. A simple Python script:
from vllm import LLM, SamplingParamsllm = LLM(model="./models/llama-3.1-8b-instruct")
sampling_params = SamplingParams(
t temperature=0.7,
top_p=0.95,
max_tokens=512
)
outputs = llm.generate(["Explain why vLLM is faster than llama.cpp"], sampling_params)
print(outputs[0].outputs[0].text)
First load is slow (model loading to GPU). Subsequent generations are fast — we're talking 30-60 tokens/second on a 4090 depending on the model and batch size.
Want to use any OpenAI-compatible client? vLLM ships a server out of the box:
vllm serve ./models/llama-3.1-8b-instruct \--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1
Now you can hit http://localhost:8000/v1/chat/completions with the same payloads you'd send to OpenAI. Switch your app's base URL and you're done. No code changes needed for most frameworks.
vLLM's throughput claim isn't marketing — PagedAttention genuinely reduces memory waste. But run your own benchmark to see the numbers for your specific model and batch size:
python -m vllm.benchmark.latency \--model ./models/llama-3.1-8b-instruct \
--num-runs 100
Compare that to whatever you're currently using. The difference is usually 2-4x for long context batches.
CUDA out of memory: Reduce max_model_len or use a smaller model. 8B fits in 16GB VRAM with proper config.
Slow first generation: Normal. That's the profiling pass vLLM does on first run to optimize the attention kernels.
Model not found: You're likely hitting a gated model without setting your HF token. import os; os.environ["HF_TOKEN"] = "hf_..."
vLLM handles the hard parts. You bring the hardware. Get it running this week and you'll never go back to cloud-only inference.