AI & Machine Learning Engineering

Production LLM Deployment Guide: Quantization, vLLM Serving & GPU Memory Optimization

MatterAI
MatterAI
18 min read·

Local LLM Production Deployment: Quantization, vLLM Serving, and GPU Memory Optimization

Production LLM deployment requires systematic optimization across model quantization, serving infrastructure, and GPU memory management. This guide provides engineering-focused strategies for maximizing throughput while minimizing latency and costs.

Quantization Fundamentals

Quantization reduces model memory footprint by converting FP16/FP32 weights to lower precision formats. The choice between quantization methods directly impacts model quality and serving efficiency.

AWQ (Activation-Aware Weight Quantization)

  • 4-bit quantization with minimal quality degradation
  • Best for: Production deployments requiring quality preservation
  • Memory reduction: 4x compared to FP16
  • Quality impact: <2% perplexity increase on average

GPTQ (Generative Pre-trained Transformer Quantization)

  • Post-training quantization for transformer models
  • Best for: Models with stable architectures
  • Supports: INT4 and INT8 precision modes
  • Calibration: Requires representative dataset for optimal results

GGUF (GPT-Generated Unified Format)

  • Optimized for local inference engines like llama.cpp
  • Quantization levels: Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_K
  • Memory calculation: model_size_gb * (quantization_bits/16)
  • Example: 70B model at Q4_K_M = 70 * (4/16) = 17.5GB VRAM

vLLM Serving Architecture

vLLM delivers 2-24x throughput improvements over conventional serving frameworks through PagedAttention, continuous batching, and Flash Attention 2.

PagedAttention Mechanism

  • Eliminates 60-80% memory waste from KV cache fragmentation
  • Virtual memory approach: Treats GPU memory like OS virtual memory
  • Non-contiguous storage: Enables efficient memory utilization
  • Dynamic allocation: Pages KV cache on-demand

Flash Attention 2

  • Critical for v0.6.x+ performance: Requires compute capability 8.0+ (Ampere or newer GPUs: A100, A10, A30, A40, RTX 3090/4090)
  • Memory bandwidth optimization: Reduces memory reads by 2-4x
  • Automatic detection: Enabled by default on supported hardware
  • Fallback: Gracefully degrades to standard attention on older GPUs

Core vLLM Configuration

# Production vLLM server launch (v0.6.x+)
# IMPORTANT: Pre-quantized AWQ models auto-detect quantization - omit --quantization flag
# Adding --quantization awq to a pre-quantized model may cause double-quantization errors
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --swap-space 16 \
    --disable-log-requests

OpenAI-Compatible API Endpoints

vLLM provides drop-in OpenAI-compatible endpoints for seamless integration:

EndpointPurpose
/v1/chat/completionsChat-style interactions with conversation history
/v1/completionsLegacy text completion
/v1/modelsList available models
/v1/embeddingsText embeddings (when enabled)
# OpenAI SDK compatibility - no code changes required
from openai import OpenAI

client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True  # Enable streaming for responsive UX
)

Streaming Response Support

Critical for production chat applications - streaming reduces perceived latency by returning tokens as generated:

# Streaming with OpenAI SDK
for chunk in client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Raw SSE streaming via curl
curl -N http://vllm-server:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "TheBloke/Llama-2-70B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Production benefits: Time-to-first-token remains identical, but users see progressive output instead of waiting for complete response.

Continuous Batching

  • Dynamic batch formation: Groups requests for efficient GPU utilization
  • Latency optimization: Maintains <100ms TTFT SLA
  • Throughput scaling: 23x improvement over non-batched serving
  • Configuration: --max-num-batched-tokens controls batch size

Tensor Parallelism

  • Model distribution across GPUs: --tensor-parallel-size N splits model weights
  • Memory scaling: Enables larger models on multiple GPUs
  • Communication overhead: Requires high-bandwidth interconnect (NVLink/NVSwitch)
  • Configuration: Set based on GPU count and model size

LoRA Adapter Support

vLLM supports multi-tenant LoRA serving for cost-efficient model customization:

# Enable LoRA adapter serving
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --enable-lora \
    --lora-modules sentiment-analysis=/adapters/sentiment \
                    code-assist=/adapters/code \
    --max-loras 4 \
    --max-lora-rank 64

Multi-tenant scenario: Single base model serves multiple fine-tuned adapters, reducing GPU memory by 10-50x compared to deploying separate models. Adapters are loaded on-demand and cached based on --max-loras.

GPU Memory Optimization

Memory management determines serving capacity and stability. Calculate requirements using accurate formulas.

Accurate Memory Calculation Formula

Total VRAM = Model_Weights + KV_Cache + overhead

KV_Cache = 2 * num_layers * max_seq_len * num_heads * head_dim * dtype_size_bytes

The factor of 2 accounts for separate K (Key) and V (Value) tensors stored for each layer.

For FP16 models: 2 * layers * seq_len * heads * head_dim * 2 bytes

Example for Llama-2-70B:

  • Layers: 80, Heads: 64, Head_dim: 128
  • KV Cache per token: 2 * 80 * 64 * 128 * 2 = 2,621,440 bytes ≈ 2.5MB
  • 4K context: 4,096 tokens × 2.5MB = ~10GB VRAM for KV cache

Critical: Underestimating KV cache by half leads to production OOM failures. Always validate calculations against actual memory profiling.

Key Memory Parameters

# Optimal memory configuration
--gpu-memory-utilization 0.85          # Reserve 15% for system overhead
--max-model-len 4096                   # Maximum sequence length
--max-num-seqs 32                      # Concurrent sequences
--swap-space 16                        # CPU swap in GB
--enable-chunked-prefill              # Reduces memory spikes

KV Cache Management

  • Primary memory consumer: 60-80% of total VRAM usage
  • Dynamic sizing: Automatically adjusts based on sequence length
  • Memory fragmentation: PagedAttention eliminates traditional waste
  • Monitoring: Track kv_cache_usage_percentage in production

Memory Optimization Strategies

  1. Sequence Length Tuning

    • Shorter sequences: Enable more concurrent requests
    • Context window: Balance quality vs. capacity
    • Typical range: 2K-8K tokens for most applications
  2. Batch Size Optimization

    • Larger batches: Improve throughput but increase TTFT
    • Target TTFT: <100ms for interactive applications
    • Monitor: time_to_first_token_percentiles
  3. Model Parallelism

    • Tensor parallelism: Split model across multiple GPUs using --tensor-parallel-size
    • Pipeline parallelism: Stage-based model distribution
    • Selection criteria: Based on model size and GPU availability

Production Deployment

Docker Deployment

FROM nvidia/cuda:12.1-devel-ubuntu20.04

# Create non-root user for security
RUN useradd -m -s /bin/bash vllm && \
    mkdir -p /app /models && \
    chown -R vllm:vllm /app /models

WORKDIR /app

# Install vLLM 0.6.x+
RUN pip install vllm==0.6.3

# Copy and set permissions
COPY start-vllm.sh /app/start-vllm.sh
RUN chmod +x /app/start-vllm.sh && chown vllm:vllm /app/start-vllm.sh

USER vllm

CMD ["/app/start-vllm.sh"]
#!/bin/bash
# start-vllm.sh - Production entrypoint with graceful shutdown
set -euo pipefail

# Graceful shutdown handler - prevents orphaned GPU processes
cleanup() {
    echo "Received shutdown signal, cleaning up..."
    # Send SIGTERM to vLLM process group for clean exit
    kill -TERM -$$ 2>/dev/null || true
    exit 0
}

trap cleanup SIGTERM SIGINT

# Launch vLLM
exec vllm serve --model "${MODEL_NAME}" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size "${TP_SIZE:-1}" \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096 \
    --enable-prefix-caching

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-40GB
      containers:
      - name: vllm
        image: vllm:latest
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
        env:
        - name: MODEL_NAME
          value: "TheBloke/Llama-2-70B-AWQ"
        - name: TP_SIZE
          value: "1"
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  # CPU utilization is ineffective for GPU-bound inference
  # Use custom metrics via Prometheus Adapter for GPU-aware scaling
  - type: Pods
    pods:
      metric:
        name: vllm_avg_prompt_tokens_per_request
      target:
        type: AverageValue
        averageValue: "500"
  - type: Pods
    pods:
      metric:
        name: vllm_gpu_memory_utilization
      target:
        type: AverageValue
        averageValue: "80"
---
# Alternative: Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vllm-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  updatePolicy:
    updateMode: "Auto"

Note: HPA with CPU metrics is ineffective for GPU-bound inference. Deploy Prometheus Adapter with custom metrics (e.g., vllm_gpu_memory_utilization, vllm_avg_prompt_tokens_per_request) or use Vertical Pod Autoscaler for memory-constrained workloads.

Performance Monitoring

Key metrics for production inference:

  1. Throughput Metrics

    • Tokens per second (TPS): Target >500 TPS for 70B models
    • Requests per second (RPS): Capacity planning metric
    • Batch efficiency: actual_batch_size / max_batch_size
  2. Latency Metrics

    • Time to First Token (TTFT): p50 <100ms, p95 <500ms
    • Time between tokens: Target <50ms for generation
    • Total response time: End-to-end request completion
  3. Resource Metrics

    • GPU memory utilization: Maintain <90% to prevent OOM
    • KV cache hit rate: >80% indicates effective caching
    • CPU utilization: Monitor for offloading overhead

Health Checks and Reliability

vLLM health endpoint varies by version:

  • v0.6.x and earlier: /health
  • v0.6.5+ (OpenAI-compatible): /health or /v1/health
# Version-aware health check
import requests

def check_health(base_url: str) -> bool:
    for path in ["/health", "/v1/health"]:
        try:
            response = requests.get(f"{base_url}{path}", timeout=5)
            if response.status_code == 200:
                return True
        except requests.RequestException:
            continue
    return False

if check_health("http://vllm-server:8000"):
    print("Service healthy")
else:
    print("Service unhealthy")

Capacity Planning

Calculate required GPU count:

GPUs_needed = (daily_tokens * avg_seq_len) / (tokens_per_second_per_gpu * 86400)

Example for 70B model:

  • Target: 10M tokens/day
  • Per GPU: 500 TPS = 43.2M tokens/day
  • Required GPUs: 10M / 43.2M = 0.23 → 1 GPU sufficient

Advanced Optimizations

Speculative Decoding

  • 2-3x speedup for predictable outputs
  • Configuration example:
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --speculative-model TheBloke/Llama-2-7B-AWQ \
    --speculative-draft-length 5 \
    --speculative-max-draft-tokens 20
  • Draft model: Smaller model generates candidates
  • Verification: Larger model validates in single pass
  • Best for: Structured outputs, low temperature

Prefix Caching

  • 400%+ utilization improvement for standardized prompts
  • Cache hits: Reuse computed KV cache for repeated prefixes
  • Configuration: --enable-prefix-caching (v0.6.x+)
  • Memory overhead: Minimal additional VRAM usage
  • Hash algorithms: --prefix-caching-hash-algo sha256_cbor for reproducible caching

Cross-Instance KV Cache Sharing

  • 3-10x latency reduction for repetitive workloads
  • Distributed cache: Share KV cache across serving instances
  • Network overhead: Requires high-bandwidth interconnect
  • Best for: Multi-instance deployments with repetitive context

Implementation Checklist

  1. Quantization Selection

    • Profile model quality vs. quantization level
    • Select AWQ for production quality requirements
    • Calculate memory requirements using accurate quantization formula (include factor of 2 for KV cache)
  2. vLLM Configuration

    • Use vLLM 0.6.x+ with vllm serve --model <model> command
    • Omit --quantization flag for pre-quantized models (auto-detected)
    • Set gpu-memory-utilization to 0.85-0.9
    • Configure max-num-seqs based on memory capacity
    • Enable enable-chunked-prefill and enable-prefix-caching
  3. Production Deployment

    • Implement health checks using /health or /v1/health endpoint
    • Configure monitoring for TPS, TTFT, and memory metrics
    • Set up alerting for latency degradation and OOM conditions
    • Use custom metrics or VPA for Kubernetes autoscaling (not CPU-based HPA)
    • Run containers as non-root user with proper WORKDIR
    • Implement signal handling in entrypoint scripts for graceful shutdown
  4. Optimization

    • Enable speculative decoding for appropriate workloads
    • Configure prefix caching for standardized prompts
    • Implement cross-instance KV cache sharing if applicable
    • Verify Flash Attention 2 support on target GPUs (Ampere+)
    • Enable streaming for chat/completion endpoints
    • Configure LoRA adapters for multi-tenant scenarios

Cost optimization results vary by workload and infrastructure. Benchmark your specific deployment using throughput-per-dollar analysis: compare vLLM's continuous batching against traditional request-per-instance architectures. Key factors include request distribution, sequence length variance, and GPU memory bandwidth. Validate memory calculations against actual production metrics before capacity planning.


MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

  • Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
  • AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
  • Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so


Follow us on X · LinkedIn · GitHub

Share this Guide:

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

No credit card requiredSOC 2 Type IISetup in 2 min