Set up vLLM for local inference in under 30 minutes. Practical guide with real commands and benchmarks.

<p>vLLM is the fastest way to run open-source LLMs on your own hardware. If you're still suffering through slow inference times or paying cloud bills, this setup will change that. Here's how to get it running in under 30 minutes.</p>

<h2>What You Need</h2> <p>Linux with NVIDIA GPU (VRAM 16GB+ recommended), CUDA 12.1+, Python 3.10+. I use Ubuntu 22.04 and a 4090. If you're on macOS, skip to the Ollama article — vLLM doesn't do Metal yet.</p>

<pre><code># Check your CUDA version nvidia-smi | head -3 nvcc --version</code></pre>

<h2>Step 1: Create a Clean Environment</h2> <p>Don't install vLLM in your base Python. Use a virtual environment.</p>

<pre><code>python3 -m venv vllm-env source vllm-env/bin/activate</code></pre>

<h2>Step 2: Install vLLM</h2> <p>The official way is via pip. It downloads pre-built wheels for most CUDA versions.</p>

<pre><code>pip install vllm</code></pre>

<p>If you need the latest features (or the wheel isn't available for your CUDA version), build from source:</p>

<pre><code>git clone https://github.com/vllm-project/vllm.git cd vllm pip install -e .</code></pre>

<p>Build time: 20-40 minutes on a decent machine. Go make coffee.</p>

<h2>Step 3: Pull a Model</h2> <p>Use HuggingFace. For a balance of speed and capability, Llama 3.1 8B is my default recommendation.</p>

<pre><code># Install HF transfer tool for large models pip install huggingface_hub

Download the model

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \ --local-dir ./models/llama-3.1-8b-instruct \ --token $HF_TOKEN</code></pre>

<p>Set your token first: <code>export HF_TOKEN=hf_your_token_here</code>. Get one free at huggingface.co.</p>

<h2>Step 4: Run Inference</h2> <p>Here's the payoff. A simple Python script:</p>

<pre><code>from vllm import LLM, SamplingParams

llm = LLM(model="./models/llama-3.1-8b-instruct") sampling_params = SamplingParams( t temperature=0.7, top_p=0.95, max_tokens=512 )

outputs = llm.generate(["Explain why vLLM is faster than llama.cpp"], sampling_params) print(outputs[0].outputs[0].text)</code></pre>

<p>First load is slow (model loading to GPU). Subsequent generations are <em>fast</em> — we're talking 30-60 tokens/second on a 4090 depending on the model and batch size.</p>

<h2>Step 5: OpenAI-Compatible Server</h2> <p>Want to use any OpenAI-compatible client? vLLM ships a server out of the box:</p>

<pre><code>vllm serve ./models/llama-3.1-8b-instruct \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1</code></pre>

<p>Now you can hit <code>http://localhost:8000/v1/chat/completions</code> with the same payloads you'd send to OpenAI. Switch your app's base URL and you're done. No code changes needed for most frameworks.</p>

<h2>Benchmark Before You Trust</h2> <p>vLLM's throughput claim isn't marketing — PagedAttention genuinely reduces memory waste. But run your own benchmark to see the numbers for your specific model and batch size:</p>

<pre><code>python -m vllm.benchmark.latency \ --model ./models/llama-3.1-8b-instruct \ --num-runs 100</code></pre>

<p>Compare that to whatever you're currently using. The difference is usually 2-4x for long context batches.</p>

<h2>Common Issues</h2> <p><strong>CUDA out of memory:</strong> Reduce <code>max_model_len</code> or use a smaller model. 8B fits in 16GB VRAM with proper config.</p> <p><strong>Slow first generation:</strong> Normal. That's the profiling pass vLLM does on first run to optimize the attention kernels.</p> <p><strong>Model not found:</strong> You're likely hitting a gated model without setting your HF token. <code>import os; os.environ["HF_TOKEN"] = "hf_..."</code></p>

<p>vLLM handles the hard parts. You bring the hardware. Get it running this week and you'll never go back to cloud-only inference.</p>

How to Set Up vLLM for Local Inference

Download the model