Production LLM Deployment Guide: Quantization, vLLM Serving & GPU Memory Optimization
Local LLM Production Deployment: Quantization, vLLM Serving, and GPU Memory Optimization
Production LLM deployment requires systematic optimization across model quantization, serving infrastructure, and GPU memory management. This guide provides engineering-focused strategies for maximizing throughput while minimizing latency and costs.
Quantization Fundamentals
Quantization reduces model memory footprint by converting FP16/FP32 weights to lower precision formats. The choice between quantization methods directly impacts model quality and serving efficiency.
AWQ (Activation-Aware Weight Quantization)
- 4-bit quantization with minimal quality degradation
- Best for: Production deployments requiring quality preservation
- Memory reduction: 4x compared to FP16
- Quality impact: <2% perplexity increase on average
GPTQ (Generative Pre-trained Transformer Quantization)
- Post-training quantization for transformer models
- Best for: Models with stable architectures
- Supports: INT4 and INT8 precision modes
- Calibration: Requires representative dataset for optimal results
GGUF (GPT-Generated Unified Format)
- Optimized for local inference engines like llama.cpp
- Quantization levels: Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_K
- Memory calculation:
model_size_gb * (quantization_bits/16) - Example: 70B model at Q4_K_M = 70 * (4/16) = 17.5GB VRAM
vLLM Serving Architecture
vLLM delivers 2-24x throughput improvements over conventional serving frameworks through PagedAttention, continuous batching, and Flash Attention 2.
PagedAttention Mechanism
- Eliminates 60-80% memory waste from KV cache fragmentation
- Virtual memory approach: Treats GPU memory like OS virtual memory
- Non-contiguous storage: Enables efficient memory utilization
- Dynamic allocation: Pages KV cache on-demand
Flash Attention 2
- Critical for v0.6.x+ performance: Requires compute capability 8.0+ (Ampere or newer GPUs: A100, A10, A30, A40, RTX 3090/4090)
- Memory bandwidth optimization: Reduces memory reads by 2-4x
- Automatic detection: Enabled by default on supported hardware
- Fallback: Gracefully degrades to standard attention on older GPUs
Core vLLM Configuration
# Production vLLM server launch (v0.6.x+)
# IMPORTANT: Pre-quantized AWQ models auto-detect quantization - omit --quantization flag
# Adding --quantization awq to a pre-quantized model may cause double-quantization errors
vllm serve --model TheBloke/Llama-2-70B-AWQ \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--enable-prefix-caching \
--swap-space 16 \
--disable-log-requests
OpenAI-Compatible API Endpoints
vLLM provides drop-in OpenAI-compatible endpoints for seamless integration:
| Endpoint | Purpose |
|---|---|
/v1/chat/completions | Chat-style interactions with conversation history |
/v1/completions | Legacy text completion |
/v1/models | List available models |
/v1/embeddings | Text embeddings (when enabled) |
# OpenAI SDK compatibility - no code changes required
from openai import OpenAI
client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="TheBloke/Llama-2-70B-AWQ",
messages=[{"role": "user", "content": "Hello"}],
stream=True # Enable streaming for responsive UX
)
Streaming Response Support
Critical for production chat applications - streaming reduces perceived latency by returning tokens as generated:
# Streaming with OpenAI SDK
for chunk in client.chat.completions.create(
model="TheBloke/Llama-2-70B-AWQ",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Raw SSE streaming via curl
curl -N http://vllm-server:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "TheBloke/Llama-2-70B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
Production benefits: Time-to-first-token remains identical, but users see progressive output instead of waiting for complete response.
Continuous Batching
- Dynamic batch formation: Groups requests for efficient GPU utilization
- Latency optimization: Maintains <100ms TTFT SLA
- Throughput scaling: 23x improvement over non-batched serving
- Configuration:
--max-num-batched-tokenscontrols batch size
Tensor Parallelism
- Model distribution across GPUs:
--tensor-parallel-size Nsplits model weights - Memory scaling: Enables larger models on multiple GPUs
- Communication overhead: Requires high-bandwidth interconnect (NVLink/NVSwitch)
- Configuration: Set based on GPU count and model size
LoRA Adapter Support
vLLM supports multi-tenant LoRA serving for cost-efficient model customization:
# Enable LoRA adapter serving
vllm serve --model TheBloke/Llama-2-70B-AWQ \
--enable-lora \
--lora-modules sentiment-analysis=/adapters/sentiment \
code-assist=/adapters/code \
--max-loras 4 \
--max-lora-rank 64
Multi-tenant scenario: Single base model serves multiple fine-tuned adapters, reducing GPU memory by 10-50x compared to deploying separate models. Adapters are loaded on-demand and cached based on --max-loras.
GPU Memory Optimization
Memory management determines serving capacity and stability. Calculate requirements using accurate formulas.
Accurate Memory Calculation Formula
Total VRAM = Model_Weights + KV_Cache + overhead
KV_Cache = 2 * num_layers * max_seq_len * num_heads * head_dim * dtype_size_bytes
The factor of 2 accounts for separate K (Key) and V (Value) tensors stored for each layer.
For FP16 models: 2 * layers * seq_len * heads * head_dim * 2 bytes
Example for Llama-2-70B:
- Layers: 80, Heads: 64, Head_dim: 128
- KV Cache per token: 2 * 80 * 64 * 128 * 2 = 2,621,440 bytes ≈ 2.5MB
- 4K context: 4,096 tokens × 2.5MB = ~10GB VRAM for KV cache
Critical: Underestimating KV cache by half leads to production OOM failures. Always validate calculations against actual memory profiling.
Key Memory Parameters
# Optimal memory configuration
--gpu-memory-utilization 0.85 # Reserve 15% for system overhead
--max-model-len 4096 # Maximum sequence length
--max-num-seqs 32 # Concurrent sequences
--swap-space 16 # CPU swap in GB
--enable-chunked-prefill # Reduces memory spikes
KV Cache Management
- Primary memory consumer: 60-80% of total VRAM usage
- Dynamic sizing: Automatically adjusts based on sequence length
- Memory fragmentation: PagedAttention eliminates traditional waste
- Monitoring: Track
kv_cache_usage_percentagein production
Memory Optimization Strategies
-
Sequence Length Tuning
- Shorter sequences: Enable more concurrent requests
- Context window: Balance quality vs. capacity
- Typical range: 2K-8K tokens for most applications
-
Batch Size Optimization
- Larger batches: Improve throughput but increase TTFT
- Target TTFT: <100ms for interactive applications
- Monitor:
time_to_first_token_percentiles
-
Model Parallelism
- Tensor parallelism: Split model across multiple GPUs using
--tensor-parallel-size - Pipeline parallelism: Stage-based model distribution
- Selection criteria: Based on model size and GPU availability
- Tensor parallelism: Split model across multiple GPUs using
Production Deployment
Docker Deployment
FROM nvidia/cuda:12.1-devel-ubuntu20.04
# Create non-root user for security
RUN useradd -m -s /bin/bash vllm && \
mkdir -p /app /models && \
chown -R vllm:vllm /app /models
WORKDIR /app
# Install vLLM 0.6.x+
RUN pip install vllm==0.6.3
# Copy and set permissions
COPY start-vllm.sh /app/start-vllm.sh
RUN chmod +x /app/start-vllm.sh && chown vllm:vllm /app/start-vllm.sh
USER vllm
CMD ["/app/start-vllm.sh"]
#!/bin/bash
# start-vllm.sh - Production entrypoint with graceful shutdown
set -euo pipefail
# Graceful shutdown handler - prevents orphaned GPU processes
cleanup() {
echo "Received shutdown signal, cleaning up..."
# Send SIGTERM to vLLM process group for clean exit
kill -TERM -$$ 2>/dev/null || true
exit 0
}
trap cleanup SIGTERM SIGINT
# Launch vLLM
exec vllm serve --model "${MODEL_NAME}" \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size "${TP_SIZE:-1}" \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--enable-prefix-caching
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 3
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB
containers:
- name: vllm
image: vllm:latest
resources:
requests:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
env:
- name: MODEL_NAME
value: "TheBloke/Llama-2-70B-AWQ"
- name: TP_SIZE
value: "1"
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 10
metrics:
# CPU utilization is ineffective for GPU-bound inference
# Use custom metrics via Prometheus Adapter for GPU-aware scaling
- type: Pods
pods:
metric:
name: vllm_avg_prompt_tokens_per_request
target:
type: AverageValue
averageValue: "500"
- type: Pods
pods:
metric:
name: vllm_gpu_memory_utilization
target:
type: AverageValue
averageValue: "80"
---
# Alternative: Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: vllm-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
updatePolicy:
updateMode: "Auto"
Note: HPA with CPU metrics is ineffective for GPU-bound inference. Deploy Prometheus Adapter with custom metrics (e.g., vllm_gpu_memory_utilization, vllm_avg_prompt_tokens_per_request) or use Vertical Pod Autoscaler for memory-constrained workloads.
Performance Monitoring
Key metrics for production inference:
-
Throughput Metrics
- Tokens per second (TPS): Target >500 TPS for 70B models
- Requests per second (RPS): Capacity planning metric
- Batch efficiency:
actual_batch_size / max_batch_size
-
Latency Metrics
- Time to First Token (TTFT): p50 <100ms, p95 <500ms
- Time between tokens: Target <50ms for generation
- Total response time: End-to-end request completion
-
Resource Metrics
- GPU memory utilization: Maintain <90% to prevent OOM
- KV cache hit rate: >80% indicates effective caching
- CPU utilization: Monitor for offloading overhead
Health Checks and Reliability
vLLM health endpoint varies by version:
- v0.6.x and earlier:
/health - v0.6.5+ (OpenAI-compatible):
/healthor/v1/health
# Version-aware health check
import requests
def check_health(base_url: str) -> bool:
for path in ["/health", "/v1/health"]:
try:
response = requests.get(f"{base_url}{path}", timeout=5)
if response.status_code == 200:
return True
except requests.RequestException:
continue
return False
if check_health("http://vllm-server:8000"):
print("Service healthy")
else:
print("Service unhealthy")
Capacity Planning
Calculate required GPU count:
GPUs_needed = (daily_tokens * avg_seq_len) / (tokens_per_second_per_gpu * 86400)
Example for 70B model:
- Target: 10M tokens/day
- Per GPU: 500 TPS = 43.2M tokens/day
- Required GPUs: 10M / 43.2M = 0.23 → 1 GPU sufficient
Advanced Optimizations
Speculative Decoding
- 2-3x speedup for predictable outputs
- Configuration example:
vllm serve --model TheBloke/Llama-2-70B-AWQ \
--speculative-model TheBloke/Llama-2-7B-AWQ \
--speculative-draft-length 5 \
--speculative-max-draft-tokens 20
- Draft model: Smaller model generates candidates
- Verification: Larger model validates in single pass
- Best for: Structured outputs, low temperature
Prefix Caching
- 400%+ utilization improvement for standardized prompts
- Cache hits: Reuse computed KV cache for repeated prefixes
- Configuration:
--enable-prefix-caching(v0.6.x+) - Memory overhead: Minimal additional VRAM usage
- Hash algorithms:
--prefix-caching-hash-algo sha256_cborfor reproducible caching
Cross-Instance KV Cache Sharing
- 3-10x latency reduction for repetitive workloads
- Distributed cache: Share KV cache across serving instances
- Network overhead: Requires high-bandwidth interconnect
- Best for: Multi-instance deployments with repetitive context
Implementation Checklist
-
Quantization Selection
- Profile model quality vs. quantization level
- Select AWQ for production quality requirements
- Calculate memory requirements using accurate quantization formula (include factor of 2 for KV cache)
-
vLLM Configuration
- Use vLLM 0.6.x+ with
vllm serve --model <model>command - Omit
--quantizationflag for pre-quantized models (auto-detected) - Set
gpu-memory-utilizationto 0.85-0.9 - Configure
max-num-seqsbased on memory capacity - Enable
enable-chunked-prefillandenable-prefix-caching
- Use vLLM 0.6.x+ with
-
Production Deployment
- Implement health checks using
/healthor/v1/healthendpoint - Configure monitoring for TPS, TTFT, and memory metrics
- Set up alerting for latency degradation and OOM conditions
- Use custom metrics or VPA for Kubernetes autoscaling (not CPU-based HPA)
- Run containers as non-root user with proper WORKDIR
- Implement signal handling in entrypoint scripts for graceful shutdown
- Implement health checks using
-
Optimization
- Enable speculative decoding for appropriate workloads
- Configure prefix caching for standardized prompts
- Implement cross-instance KV cache sharing if applicable
- Verify Flash Attention 2 support on target GPUs (Ampere+)
- Enable streaming for chat/completion endpoints
- Configure LoRA adapters for multi-tenant scenarios
Cost optimization results vary by workload and infrastructure. Benchmark your specific deployment using throughput-per-dollar analysis: compare vLLM's continuous batching against traditional request-per-instance architectures. Key factors include request distribution, sequence length variance, and GPU memory bandwidth. Validate memory calculations against actual production metrics before capacity planning.
MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.
Explore what we're building:
- Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
- AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
- Axon Models — Frontier-grade reasoning models at 70% lower inference cost
Share this Guide:
More Guides
LLM Integration for AI Agents: A Complete Engineering FAQ
Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.
22 min readAgentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readGleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems
Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.
5 min readContinue Reading
LLM Integration for AI Agents: A Complete Engineering FAQ
Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.
22 min readAgentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readShip Faster. Ship Safer.
Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.
