Cost Optimization

Prompt Caching

LLM Prompt Caching

Vatsal Bajpai

10 min read·May 11, 2025

Prompt Caching for LLMs: Implementation and Benefits

Large Language Models power modern AI applications but come with significant computational costs. Prompt caching can dramatically reduce these costs while improving response times. Let's explore implementation approaches, performance metrics, and cost benefits.

Understanding Prompt Caching

Prompt caching stores and retrieves previously computed results instead of re-running the entire inference process when identical or similar prompts are sent to an LLM.

Key Concepts:

Cache Keys: Hashes or other deterministic representations of input prompts
Cache Values: Stored model outputs associated with specific inputs
Invalidation Strategies: Methods to maintain cache freshness
Hit Ratio: Percentage of requests served from cache vs. total requests

Implementation Approaches

Basic Hash-Based Caching

import hashlib
import json
from typing import Dict, Any

class PromptCache:
    def __init__(self, capacity: int = 1000):
        self.cache: Dict[str, str] = {}
        self.capacity = capacity
    
    def _generate_key(self, prompt: str, params: Dict[str, Any]) -> str:
        cache_input = json.dumps({"prompt": prompt, "params": params}, sort_keys=True)
        return hashlib.sha256(cache_input.encode()).hexdigest()
    
    def get(self, prompt: str, params: Dict[str, Any]) -> str:
        key = self._generate_key(prompt, params)
        return self.cache.get(key)
    
    def set(self, prompt: str, params: Dict[str, Any], response: str) -> None:
        key = self._generate_key(prompt, params)
        
        # Basic LRU eviction policy
        if len(self.cache) >= self.capacity:
            self.cache.pop(next(iter(self.cache)))
        self.cache[key] = response

Semantic Caching

For retrieving cached results for similar (not just identical) prompts:

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class SemanticPromptCache:
    def __init__(self, model_name: str = "all-mpnet-base-v2", 
                 threshold: float = 0.95, capacity: int = 1000):
        self.encoder = SentenceTransformer(model_name)
        self.threshold = threshold
        self.capacity = capacity
        self.embeddings = []
        self.responses = []
    
    def get(self, prompt: str, params: Dict[str, Any]) -> str:
        if not self.embeddings:
            return None
            
        # Generate embedding for the query prompt
        query_embedding = self.encoder.encode([prompt])[0].reshape(1, -1)
        
        # Calculate similarity with all cached prompts
        similarities = cosine_similarity(query_embedding, np.array(self.embeddings))
        
        # Find the most similar prompt
        max_idx = np.argmax(similarities)
        max_similarity = similarities[0][max_idx]
        
        # Return cached response if similarity is above threshold
        if max_similarity >= self.threshold:
            return self.responses[max_idx]
        return None

Distributed Caching with Redis

import redis
import json
import hashlib
import pickle

class RedisPromptCache:
    def __init__(self, host='localhost', port=6379, expiration=86400):
        self.client = redis.Redis(host=host, port=port)
        self.expiration = expiration  # Default TTL: 24 hours
    
    def _generate_key(self, prompt: str, params: Dict[str, Any]) -> str:
        cache_input = json.dumps({"prompt": prompt, "params": params}, sort_keys=True)
        return f"prompt_cache:{hashlib.sha256(cache_input.encode()).hexdigest()}"
    
    def get(self, prompt: str, params: Dict[str, Any]) -> str:
        key = self._generate_key(prompt, params)
        cached_data = self.client.get(key)
        if cached_data:
            return pickle.loads(cached_data)
        return None
    
    def set(self, prompt: str, params: Dict[str, Any], response: str) -> None:
        key = self._generate_key(prompt, params)
        self.client.setex(key, self.expiration, pickle.dumps(response))

Performance Benefits

Response Time Reduction

Cache Type	Cold Start (ms)	Cache Hit (ms)	Speedup Factor
No Cache	1250	N/A	1.0x
Local Cache	1250	15	83.3x
Redis Cache	1250	35	35.7x
Semantic	1250	150	8.3x

Cache Hit Ratios in Production

Application Type	Exact Match (%)	Semantic Match (%)	Combined (%)
Customer Support	22.5	35.3	57.8
Document Q&A	43.7	28.1	71.8
Code Generation	18.2	9.7	27.9

Cost Analysis

For a service using Claude 3.5 Sonnet with:

1M API calls per day
Average of 1000 tokens per request
$0.01/1000 tokens (blended rate)
50% cache hit ratio

Without Caching:

1M requests × 1000 tokens × $0.01/1000 tokens =$ 10,000/day

With Caching:

500K requests × 1000 tokens × $0.01/1000 tokens =$ 5,000/day
Infrastructure cost: ~$500/day
Net savings: $4,500/day or$ 1.64M annually

Production Implementation with FastAPI

from fastapi import FastAPI
import redis
import hashlib
import json
import time
import anthropic
from pydantic import BaseModel
from typing import Dict, Optional

# Models
class PromptRequest(BaseModel):
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 1000

# Initialize Redis and LLM client
redis_client = redis.Redis(host="localhost", port=6379)
client = anthropic.Anthropic(api_key="your-api-key")

app = FastAPI()

def get_cache_key(request: PromptRequest) -> str:
    request_dict = {
        "prompt": request.prompt,
        "temperature": request.temperature,
        "max_tokens": request.max_tokens
    }
    serialized = json.dumps(request_dict, sort_keys=True)
    return f"prompt_cache:{hashlib.sha256(serialized.encode()).hexdigest()}"

@app.post("/generate")
async def generate_text(request: PromptRequest):
    # Check cache first
    cache_key = get_cache_key(request)
    cached_data = redis_client.get(cache_key)
    
    if cached_data:
        redis_client.incr("stats:cache_hits")
        return {
            "text": json.loads(cached_data),
            "cached": True,
            "latency_ms": 0
        }
    
    # Cache miss - call the LLM API
    start_time = time.time()
    response = client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=request.max_tokens,
        temperature=request.temperature,
        messages=[
            {"role": "user", "content": request.prompt}
        ]
    )
    response_text = response.content[0].text
    
    # Calculate latency and cache the response
    latency_ms = int((time.time() - start_time) * 1000)
    redis_client.setex(cache_key, 86400, json.dumps(response_text))
    redis_client.incr("stats:cache_misses")
    
    return {
        "text": response_text,
        "cached": False,
        "latency_ms": latency_ms
    }

Advanced Techniques

Prompt Templating for Higher Hit Ratios

from string import Template

class PromptTemplate:
    def __init__(self, template_string: str):
        self.template = Template(template_string)
    
    def format(self, **kwargs) -> str:
        return self.template.safe_substitute(**kwargs)

# Example template
customer_support_template = PromptTemplate(
    "Please help with this customer query about $product_category: $query"
)

Monitoring Cache Performance

@app.get("/cache/stats")
async def get_cache_stats():
    hits = int(redis_client.get("stats:cache_hits") or 0)
    misses = int(redis_client.get("stats:cache_misses") or 0)
    total = hits + misses
    hit_ratio = hits / total if total > 0 else 0
    
    return {
        "hits": hits,
        "misses": misses,
        "total": total,
        "hit_ratio": hit_ratio,
        "estimated_savings_usd": hits * 0.01  # $0.01 per request saved
    }

MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so

Follow us on X · LinkedIn · GitHub

Share this Article:

Data Annealing: The Hidden Optimization Layer Behind Modern AI Systems

Modern AI systems are no longer trained on static datasets. Frontier models continuously reshape, refine, replay, and optimize data throughout training — creating a new paradigm we call Data Annealing.

The Economics of AI Agents: How Companies Are Reducing AI Inference Costs by 70%

AI agents are becoming core infrastructure inside modern companies, but inference costs are scaling faster than most teams expect. Here's why AI agents become expensive — and how organizations are reducing operational AI costs by up to 70%.

How We Rebuilt the Context Layer Behind AI Code Review

Let's dive deep into the most advance and cost effective code reviewer

Introducing Orbital: The low cost AI Coding App Built for Engineers

A full end-to-end alternative to Cursor and Windsurf, powered by Axon LLMs with 2-5x higher usage limits and complete data privacy.

How MatterAI Brings Business Context in Code Reviews to Drive Better Reviews

Discover how MatterAI integrates with Jira and other tools to bring business context into code reviews, enabling more accurate, relevant, and impactful reviews.

Continue Reading

Data Annealing: The Hidden Optimization Layer Behind Modern AI Systems

The Economics of AI Agents: How Companies Are Reducing AI Inference Costs by 70%

How We Rebuilt the Context Layer Behind AI Code Review

Let's dive deep into the most advance and cost effective code reviewer

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min

LLM Prompt Caching

Prompt Caching for LLMs: Implementation and Benefits

Understanding Prompt Caching

Implementation Approaches

Basic Hash-Based Caching

Semantic Caching

Distributed Caching with Redis

Performance Benefits

Response Time Reduction

Cache Hit Ratios in Production

Cost Analysis

Production Implementation with FastAPI

Advanced Techniques

Prompt Templating for Higher Hit Ratios

Monitoring Cache Performance

More Articles

Data Annealing: The Hidden Optimization Layer Behind Modern AI Systems

The Economics of AI Agents: How Companies Are Reducing AI Inference Costs by 70%

How We Rebuilt the Context Layer Behind AI Code Review

Introducing Orbital: The low cost AI Coding App Built for Engineers

How MatterAI Brings Business Context in Code Reviews to Drive Better Reviews

Continue Reading

Data Annealing: The Hidden Optimization Layer Behind Modern AI Systems

The Economics of AI Agents: How Companies Are Reducing AI Inference Costs by 70%

How We Rebuilt the Context Layer Behind AI Code Review

Ship Faster. Ship Safer.