
Self-hosted inference is no longer a research project. If you are paying per-token for a 70B model and you own a single A100 or two RTX 4090s, serve it for free after the hardware is paid off. The trick is AWQ 4-bit quantization paired with vLLM: a 70B model on one 24GB card at ~80% of FP16 throughput, quality loss under 1% on MMLU, GSM8K, and HumanEval. Fifteen minutes from zero to a working OpenAI-compatible endpoint.
GPTQ is older and slower on modern GPUs. GGUF is the llama.cpp format — great for CPU and Apple Silicon, mediocre on NVIDIA. AWQ stores weights in INT4 but keeps activation statistics in FP16, which is what lets vLLM run them through the Marlin kernel and hit near-FP16 tokens-per-second. For NVIDIA on Ampere or newer, AWQ is the default.
A 70B in FP16 is ~140GB and needs four A100s. The same model in AWQ is ~40GB and fits on a single A100-80GB.
Step 1: Install vLLM. It ships with bundled CUDA kernels; no separate toolkit required.
bash pip install vllm nvidia-smi
Step 2: Pick an AWQ-quantized model. For 70B-class on one 24GB card: Qwen/Qwen2.5-72B-Instruct-AWQ. Smaller starter: any -AWQ Llama-3.1-8B repo.
Step 3: Start the server. vLLM auto-detects AWQ from the model config and routes to the Marlin kernel.
bash vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \ --quantization awq \ --dtype float16 \ --max-model-len 8192 \ --gpu-memory-utilization 0.92 \ --port 8000
That is the whole deploy. The server is OpenAI-compatible at http://localhost:8000/v1.
Any OpenAI client works unchanged. Only the base URL differs.
```python from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create( model="Qwen/Qwen2.5-72B-Instruct-AWQ", messages=[{"role": "user", "content": "Explain PagedAttention in 3 sentences."}], ) print(resp.choices[0].message.content) ```
Same call signature as api.openai.com. Drop-in for any existing app.
**--gpu-memory-utilization 0.92.** Bump it on a dedicated box; drop to 0.85 if you share the GPU.
**--max-model-len.** Set this explicitly. vLLM otherwise allocates KV cache for the model's full context window (often 32K-128K), wasting 10-20GB. Match it to your actual longest request plus a buffer.
**--enable-prefix-caching.** Free 30-50% latency cut on multi-turn chat and RAG. Caches KV for repeated prompt prefixes, so turn two does not re-prefill the system prompt. On by default in vLLM 0.5+; listed for clarity.
bash vllm bench serve --model Qwen/Qwen2.5-72B-Instruct-AWQ \ --num-prompts 200 --request-rate 8 --base-url http://localhost:8000
A single A100-80GB serving 72B-AWQ sustains ~25-30 tokens/second/user at 8 concurrent requests before queueing starts.
Do not load an AWQ model with --quantization gptq flags. vLLM will silently dequantize to FP16 and double your memory.
Do not skip nvidia-smi after vllm serve. Process up but GPU memory empty means the model did not load — usually a tokenizer mismatch. The last 30 lines of the startup log are the truth.
Do not run vLLM in Docker without --gpus all and --ipc=host. Tensor parallel workers need the shared memory IPC.
AWQ 4-bit plus vLLM's Marlin kernel is the cheapest credible path to self-hosted 70B-class inference on commodity NVIDIA. One command to serve, OpenAI-compatible endpoint, near-FP16 throughput, under 1% quality loss. If you are burning a few thousand a month on inference, this pays for itself in days.
— Mr. Technology
*Prerequisites: NVIDIA GPU with 24GB+ VRAM (Ampere+), CUDA 12.1+, Python 3.10+. Tested with vLLM 0.7+ and Qwen/Qwen2.5-72B-Instruct-AWQ on a single A100-80GB. AWQ repos are tagged -AWQ on Hugging Face. vLLM also supports GPTQ, BitsAndBytes, and FP8; AWQ is the default for Ampere/Ada/Blackwell.*