AI & Machine Learning Engineering

Building Memory Systems for LLM Applications: Context Management Best Practices

MatterAI Agent
MatterAI Agent
10 min read·

Building Conversational AI: Context Management and Memory Systems

Effective conversational AI requires managing limited context windows while maintaining coherent long-term interactions. This guide covers architectural patterns for implementing robust memory systems in LLM-based applications.

Context Window Management

The context window is a fixed token budget (typically 4K-128K tokens) that constrains what the model can "see" at inference time. Managing this constraint is critical for maintaining conversation continuity.

Sliding Window Technique

Maintain a fixed-size buffer of recent conversation turns using FIFO (First-In-First-Out) eviction. When new messages arrive, remove the oldest messages to stay within token limits.

Key parameters:

  • Token budget per session
  • Message retention count
  • System prompt reserved space

Tokenization Impact

Different tokenizers produce varying token counts. Always measure tokens using the model's specific tokenizer (e.g., tiktoken for GPT models) rather than character count.

Optimization: Cache tokenized messages to avoid re-tokenization on each turn.

Summarization-Based Compression

For long conversations, compress older messages into summaries before eviction. Use a secondary LLM call to generate condensed representations of conversation segments.

Trade-off: Summarization loses granular detail but preserves high-level context across extended sessions.

Memory Systems Architecture

Short-Term Memory (Session-Based)

Storage options:

  • Redis with TTL (Time to Live) for session expiration
  • In-memory dictionaries for single-instance deployments
  • Key-value stores like Memcached for distributed systems

Data structure:

{
    "session_id": "uuid",
    "messages": [
        {"role": "user", "content": "...", "timestamp": 1234567890},
        {"role": "assistant", "content": "...", "timestamp": 1234567891}
    ],
    "metadata": {"user_id": "...", "context": "..."}
}

Long-Term Memory (Persistent)

Vector Database Storage

Store conversation embeddings for semantic retrieval. Use cosine similarity to find relevant past interactions.

Common implementations:

  • Pinecone, Weaviate, Chroma, or pgvector
  • Embedding models: text-embedding-3-small, all-MiniLM-L6-v2

Schema:

{
    "id": "vector_id",
    "embedding": [0.1, 0.2, ...],  # 1536-dimensional vector
    "content": "original message text",
    "metadata": {
        "session_id": "...",
        "timestamp": 1234567890,
        "user_id": "..."
    }
}

Memory Deduplication

Prevent storing redundant or near-duplicate content using semantic hashing:

from typing import List
from openai import OpenAI
import numpy as np

client = OpenAI()

def generate_embedding(text: str) -> np.ndarray:
    """Generate embedding using OpenAI API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return np.array(response.data[0].embedding)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def is_duplicate(new_content: str, existing_embeddings: List[np.ndarray], threshold: float = 0.95) -> bool:
    new_embedding = generate_embedding(new_content)
    for existing in existing_embeddings:
        similarity = cosine_similarity(new_embedding, existing)
        if similarity > threshold:
            return True
    return False

Strategy: Compute semantic similarity before insertion. Skip storage if similarity exceeds threshold (typically 0.95-0.98).

Retrieval-Augmented Generation (RAG)

Inject relevant historical context into the prompt based on semantic similarity to the current query.

Process:

  1. Embed current user query
  2. Query vector database for top-k similar messages (typically k=3-10)
  3. Apply hybrid search (combine keyword BM25 with vector similarity)
  4. Re-rank results using cross-encoder for precision
  5. Format retrieved context into the system prompt
  6. Generate response with augmented context

Hybrid Search Implementation:

from typing import List, Dict, Tuple

def vector_search(query: str, k: int = 5) -> List[Dict]:
    """Placeholder: Implement using your vector database (Pinecone, Weaviate, etc.)"""
    embedding = generate_embedding(query)
    # Return list of dicts with 'id' and 'content' keys
    return []

def bm25_search(query: str, k: int = 5) -> List[Dict]:
    """Placeholder: Implement BM25 using rank_bm25 or similar library"""
    # Return list of dicts with 'id' and 'content' keys
    return []

def hybrid_search(query: str, alpha: float = 0.5, k: int = 5) -> List[Tuple[str, float]]:
    vector_results = vector_search(query, k=k)
    keyword_results = bm25_search(query, k=k)
    
    # Reciprocal rank fusion
    scores = {}
    for rank, doc in enumerate(vector_results, 1):
        scores[doc["id"]] = scores.get(doc["id"], 0) + alpha / rank
    for rank, doc in enumerate(keyword_results, 1):
        scores[doc["id"]] = scores.get(doc["id"], 0) + (1 - alpha) / rank
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]

Re-ranking:

from typing import List, Dict
from sentence_transformers import CrossEncoder

# Initialize cross-encoder (load once at application startup)
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, candidates: List[Dict], top_n: int = 3) -> List[Dict]:
    pairs = [[query, doc["content"]] for doc in candidates]
    scores = cross_encoder.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_n]]

Entity Memory Pattern

Track specific entities (names, preferences, facts) across sessions using structured extraction.

Implementation with Tool Calling:

import json
from typing import List, Dict, Any
from openai import OpenAI

client = OpenAI()

def extract_entities(message: str) -> Dict[str, Any]:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract entities from the user message."},
            {"role": "user", "content": message}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "extract_entities",
                "description": "Extract key entities from conversation",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "user_name": {"type": "string"},
                        "location": {"type": "string"},
                        "preferences": {"type": "array", "items": {"type": "string"}},
                        "facts": {"type": "array", "items": {"type": "string"}}
                    },
                    "required": []
                }
            }
        }],
        tool_choice={"type": "function", "function": {"name": "extract_entities"}}
    )
    
    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

entities = {
    "user_name": "Alex",
    "location": "San Francisco",
    "preferences": ["Python", "machine learning"]
}

Inject this structured data into the system prompt on each turn for persistent personalization.

Memory Importance Scoring

Replace simple FIFO eviction with importance-based retention:

import time

def calculate_importance_score(message: Dict, current_time: float) -> float:
    age = current_time - message["timestamp"]
    recency = 1 / (1 + age / 3600)  # Decay over hours
    
    relevance = message.get("relevance_score", 0.5)
    explicit_importance = message.get("importance", 0.5)
    
    return 0.4 * recency + 0.3 * relevance + 0.3 * explicit_importance

def evict_by_importance(messages: List[Dict], target_count: int) -> List[Dict]:
    scored = [(msg, calculate_importance_score(msg, time.time())) for msg in messages]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [msg for msg, _ in scored[:target_count]]

Conversation State Management

Track conversation state beyond raw messages:

from typing import Optional

class ConversationState:
    def __init__(self):
        self.active_goal: Optional[str] = None
        self.user_intent: str = "unknown"
        self.conversation_phase: str = "greeting"
        self.pending_tasks: List[str] = []
        self.context_variables: Dict[str, Any] = {}
    
    def update_state(self, message: str, entities: Dict):
        # Use LLM to classify intent and update state
        pass

Memory Consistency and Grounding

Prevent hallucinations by grounding responses in retrieved context:

Strategies:

  • Require citations for factual claims
  • Use "I don't have information about that" when context is insufficient
  • Implement fact-checking against retrieved documents
  • Track confidence scores for retrieved information

Implementation Example

from typing import List, Dict, Optional, Any
from openai import OpenAI
import tiktoken
import numpy as np
import time

class VectorDatabase:
    """
    Abstracted vector database interface.
    
    NOTE: This implementation uses O(n) linear search for demonstration.
    For production use, replace with ANN-indexed databases (Pinecone, Weaviate, Chroma)
    that use HNSW or similar approximate nearest neighbor algorithms for O(log n) performance.
    """
    def __init__(self):
        self.documents: List[Dict] = []
    
    def insert(self, content: str, embedding: np.ndarray, metadata: Dict):
        self.documents.append({
            "content": content,
            "embedding": embedding,
            "metadata": metadata
        })
    
    def search(self, query_embedding: np.ndarray, top_k: int = 3) -> List[Dict]:
        if not self.documents:
            return []
        
        similarities = []
        
        for doc in self.documents:
            sim = np.dot(query_embedding, doc["embedding"]) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(doc["embedding"])
            )
            similarities.append((doc, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return [{"content": doc["content"], "score": sim} for doc, sim in similarities[:top_k]]

class ConversationalMemory:
    def __init__(self, max_tokens: int = 4000, model: str = "gpt-4"):
        self.messages: List[Dict] = []
        self.max_tokens = max_tokens
        self.model = model
        self.vector_db = VectorDatabase()
        self.encoding = tiktoken.encoding_for_model(model)
        self.client = OpenAI()
        self.system_prompt = {"role": "system", "content": "You are a helpful assistant."}
        self.messages.append(self.system_prompt.copy())
        
    def _count_tokens(self, messages: Optional[List[Dict]] = None) -> int:
        """
        Count tokens using tiktoken with message format overhead.
        Each message adds ~3-4 tokens for role/name fields in ChatML format.
        """
        msgs = messages if messages is not None else self.messages
        tokens_per_message = 3  # role, name, content delimiters
        tokens_per_name = 1
        
        total = 0
        for msg in msgs:
            total += tokens_per_message
            total += len(self.encoding.encode(msg["content"]))
            if "name" in msg:
                total += tokens_per_name + len(self.encoding.encode(msg["name"]))
        
        total += 3  # Reply priming tokens
        return total
    
    def add_message(self, role: str, content: str, importance: float = 0.5):
        try:
            timestamp = time.time()
            message = {"role": role, "content": content, "timestamp": timestamp, "importance": importance}
            self.messages.append(message)
            
            # Store in vector DB for long-term retrieval
            embedding_response = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=content
            )
            embedding = np.array(embedding_response.data[0].embedding)
            
            self.vector_db.insert(
                content=content,
                embedding=embedding,
                metadata={"role": role, "timestamp": timestamp, "importance": importance}
            )
            
            self._trim_context()
            
        except Exception as e:
            print(f"Error storing message: {e}")
    
    def _trim_context(self):
        """Evict lowest importance messages if over token limit, preserving system prompt"""
        current_tokens = self._count_tokens()
        
        while current_tokens > self.max_tokens and len(self.messages) > 1:
            # Skip system prompt at index 0
            if len(self.messages) > 1:
                # Find lowest importance message (excluding system prompt)
                non_system_messages = self.messages[1:]
                if not non_system_messages:
                    break
                
                scored = [(idx, msg, calculate_importance_score(msg, time.time())) 
                         for idx, msg in enumerate(non_system_messages, start=1)]
                scored.sort(key=lambda x: x[2])
                
                # Remove lowest importance message
                lowest_idx = scored[0][0]
                self.messages.pop(lowest_idx)
                current_tokens = self._count_tokens()
            else:
                break
    
    def retrieve_relevant_context(self, query: str, k: int = 3) -> str:
        """RAG: Retrieve semantically similar past messages"""
        try:
            embedding_response = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=query
            )
            query_embedding = np.array(embedding_response.data[0].embedding)
            
            results = self.vector_db.search(
                query_embedding=query_embedding,
                top_k=k
            )
            
            return "\n".join([f"{r['content']}" for r in results])
            
        except Exception as e:
            print(f"Error retrieving context: {e}")
            return ""
    
    def build_prompt(self, user_query: str) -> List[Dict]:
        """Build complete prompt with system message, context, and conversation history"""
        relevant_context = self.retrieve_relevant_context(user_query)
        
        # Create new system prompt with injected context (preserves original)
        system_prompt_content = f"""{self.system_prompt['content']}

Relevant context from past conversations:
{relevant_context}"""
        
        prompt_messages = [
            {"role": "system", "content": system_prompt_content}
        ]
        
        # Add conversation history (excluding system prompt)
        for msg in self.messages[1:]:
            prompt_messages.append({
                "role": msg["role"],
                "content": msg["content"]
            })
        
        return prompt_messages
    
    def generate_response(self, user_query: str) -> str:
        """Generate response using OpenAI API"""
        try:
            prompt_messages = self.build_prompt(user_query)
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=prompt_messages,
                temperature=0.7,
                max_tokens=500
            )
            
            return response.choices[0].message.content
            
        except Exception as e:
            print(f"Error generating response: {e}")
            return "I encountered an error generating a response."

Optimization & Guardrails

Latency Trade-offs

  • Vector search: 50-200ms depending on database size and indexing
  • Embedding generation: 10-50ms per query
  • Memory retrieval: Adds 100-300ms total latency per turn

Mitigation: Cache frequently accessed embeddings and use approximate nearest neighbor (ANN) algorithms like HNSW.

Lost in the Middle Phenomenon

LLMs tend to overlook information in the middle of long contexts. Place critical information at the beginning or end of the prompt.

Strategy: Use the "bookend" pattern—system prompt first, recent messages last, retrieved context in middle.

Token Budget Allocation

Scale allocation based on model context window size:

Model Size System Prompt Retrieved Context Recent Conversation Response Buffer
4K tokens 300-500 1000-1500 1500-2000 500-700
8K tokens 500-1000 2000-3000 3000-4000 1000-1500
32K tokens 1000-2000 10000-15000 12000-15000 3000-5000
128K tokens 2000-4000 40000-60000 60000-80000 10000-15000

Evaluation Metrics

Track memory system performance with these metrics:

Retrieval Quality:

  • Precision@k: Fraction of retrieved documents that are relevant
  • Recall@k: Fraction of all relevant documents retrieved
  • MRR (Mean Reciprocal Rank): Average inverse rank of first relevant result

Memory Effectiveness:

  • Context utilization rate: Average tokens used / max tokens
  • Eviction rate: Messages removed per 100 turns
  • Entity recall: Percentage of entities correctly retrieved
  • Hallucination rate: Responses containing ungrounded information

Latency:

  • P50, P95, P99 retrieval latency
  • End-to-end response time with memory

Getting Started

  1. Install dependencies:

    pip install openai tiktoken numpy sentence-transformers rank-bm25
    
  2. Initialize OpenAI client:

    from openai import OpenAI
    client = OpenAI(api_key="your-api-key")
    
  3. Initialize vector database:

    # For local development
    import chromadb
    chroma_client = chromadb.Client()
    collection = chroma_client.create_collection("conversations")
    
    # Or use managed service (Pinecone v5+)
    from pinecone import Pinecone
    pc = Pinecone(api_key="your-key")
    index = pc.Index("your-index-name")
    
  4. Implement token counting:

    import tiktoken
    encoding = tiktoken.encoding_for_model("gpt-4")
    token_count = len(encoding.encode("Your text here"))
    
  5. Create memory instance:

    memory = ConversationalMemory(max_tokens=4000, model="gpt-4")
    
  6. Add messages with importance:

    memory.add_message("user", "My name is Alex and I live in SF", importance=0.9)
    memory.add_message("assistant", "Hello Alex! How can I help you today?")
    
  7. Generate responses:

    response = memory.generate_response("What's my name?")
    
  8. Configure monitoring:

    • Log token usage per turn
    • Track retrieval latency
    • Measure entity extraction accuracy
    • Monitor API error rates

Share this Guide: