Self-Hosted LLM Inference with vLLM + AWQ in 15 Minutes: Cut GPU Cost Without Hurting Throughput

AWQ 4-bit quantization shoves a 70B model onto a single 24GB GPU with negligible quality loss. vLLM's Marlin kernel serves AWQ weights at near-FP16 throughput. Here is the production-ready setup with a working OpenAI-compatible endpoint, a load test, and the three flags that decide whether it actually runs.

Self-hosted inference is no longer a research project. If you are paying per-token for a 70B model and you own a single A100 or two RTX 4090s, serve it for free after the hardware is paid off. The trick is AWQ 4-bit quantization paired with vLLM: a 70B model on one 24GB card at ~80% of FP16 throughput, quality loss under 1% on MMLU, GSM8K, and HumanEval. Fifteen minutes from zero to a working OpenAI-compatible endpoint.

Why AWQ, Not GGUF, Not GPTQ

GPTQ is older and slower on modern GPUs. GGUF is the llama.cpp format — great for CPU and Apple Silicon, mediocre on NVIDIA. AWQ stores weights in INT4 but keeps activation statistics in FP16, which is what lets vLLM run them through the Marlin kernel and hit near-FP16 tokens-per-second. For NVIDIA on Ampere or newer, AWQ is the default.

A 70B in FP16 is ~140GB and needs four A100s. The same model in AWQ is ~40GB and fits on a single A100-80GB.

The Setup

Step 1: Install vLLM. It ships with bundled CUDA kernels; no separate toolkit required.

bash

pip install vllm
nvidia-smi

Step 2: Pick an AWQ-quantized model. For 70B-class on one 24GB card: Qwen/Qwen2.5-72B-Instruct-AWQ. Smaller starter: any -AWQ Llama-3.1-8B repo.

Step 3: Start the server. vLLM auto-detects AWQ from the model config and routes to the Marlin kernel.

bash

vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
  --quantization awq \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --port 8000

That is the whole deploy. The server is OpenAI-compatible at http://localhost:8000/v1.

Pointing Your App at It

Any OpenAI client works unchanged. Only the base URL differs.

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct-AWQ",
    messages=[{"role": "user", "content": "Explain PagedAttention in 3 sentences."}],
)
print(resp.choices[0].message.content)

Same call signature as api.openai.com. Drop-in for any existing app.

The Three Flags That Matter

**--gpu-memory-utilization 0.92.** Bump it on a dedicated box; drop to 0.85 if you share the GPU.

**--max-model-len.** Set this explicitly. vLLM otherwise allocates KV cache for the model's full context window (often 32K-128K), wasting 10-20GB. Match it to your actual longest request plus a buffer.

**--enable-prefix-caching.** Free 30-50% latency cut on multi-turn chat and RAG. Caches KV for repeated prompt prefixes, so turn two does not re-prefill the system prompt. On by default in vLLM 0.5+; listed for clarity.

Load Test Before You Trust It

bash

vllm bench serve --model Qwen/Qwen2.5-72B-Instruct-AWQ \
  --num-prompts 200 --request-rate 8 --base-url http://localhost:8000

A single A100-80GB serving 72B-AWQ sustains ~25-30 tokens/second/user at 8 concurrent requests before queueing starts.

The Gotchas

Do not load an AWQ model with --quantization gptq flags. vLLM will silently dequantize to FP16 and double your memory.

Do not skip nvidia-smi after vllm serve. Process up but GPU memory empty means the model did not load — usually a tokenizer mismatch. The last 30 lines of the startup log are the truth.

Do not run vLLM in Docker without --gpus all and --ipc=host. Tensor parallel workers need the shared memory IPC.

The Take

AWQ 4-bit plus vLLM's Marlin kernel is the cheapest credible path to self-hosted 70B-class inference on commodity NVIDIA. One command to serve, OpenAI-compatible endpoint, near-FP16 throughput, under 1% quality loss. If you are burning a few thousand a month on inference, this pays for itself in days.

— Mr. Technology

*Prerequisites: NVIDIA GPU with 24GB+ VRAM (Ampere+), CUDA 12.1+, Python 3.10+. Tested with vLLM 0.7+ and Qwen/Qwen2.5-72B-Instruct-AWQ on a single A100-80GB. AWQ repos are tagged -AWQ on Hugging Face. vLLM also supports GPTQ, BitsAndBytes, and FP8; AWQ is the default for Ampere/Ada/Blackwell.*