Context Window Management: Beyond RAG

What is the Context Crisis?

The Context Crisis is the inflection point where naive context stuffing fails. Despite 200K-2M token windows, retrieval precision degrades past ~70K tokens, latency balloons quadratically, and costs become prohibitive. Capacity is not capability—sophisticated context management must intervene.

Frontier models advertise 200K, 500K, 2 million token context windows. The temptation: treat context as an infinite scratchpad. Dump entire codebases, legal libraries, conversation histories. Let the model figure it out.

The engineering reality contradicts this. Capacity is not synonymous with capability.

As context grows, retrieval precision degrades. Latency balloons. Costs become prohibitive. We're witnessing the emergence of a Context Crisis—the inflection point where naive context stuffing fails and sophisticated management must intervene.

The Physics of Context

The "context window" is not a text buffer. It's a manifestation of the Key-Value (KV) Cache in transformer attention—and managing it is the primary bottleneck in scaling agent behavior.

The Memory Reality

Every token in context requires storing Key and Value vectors. For a 70B parameter model, a context of 100,000 tokens consumes over 100 gigabytes of VRAM—often exceeding the memory required for the model weights themselves.

In production environments, this creates memory pressure. One user's 128K context displaces capacity to serve ten users with shorter contexts. Context management is KV cache management. The goal: maximize semantic density while minimizing physical footprint.

The Latency Tax

Prefill phase: Before generating a single word, the model processes the entire input—system instructions, conversation history, retrieved documents. This manifests as "Time to First Token" (TTFT).

Stuffing 50 documents into context? Expect 10-30 second TTFT. The illusion of a real-time conversational partner evaporates.

Decode phase: Once generation begins, the model fetches the massive KV cache for every new token. Bloated context slows generation speed, making the agent feel sluggish.

The 200K window exists. That doesn't mean you should use it.

Lost in the Middle

Even with infinite VRAM and zero latency, semantic limits remain.

Research consistently shows LLM performance is not uniform across the context window. Models excel at retrieving information from the beginning (primacy bias) and end (recency bias) but struggle with information in the middle—the "dead zone."

In a naive RAG setup, the most critical chunk might land at position 10 of 20. Squarely in the dead zone. The model hallucinates or claims ignorance, despite the answer being present.

More context is often worse context. If additional documents lower the signal-to-noise ratio, retrieval precision drops. The solution isn't always "add more."

Taxonomy of Strategies

When required context exceeds limits—or latency/budget constraints—engineers deploy strategies based on how they reduce information load: selection, compression, or hierarchical paging.

Sliding Windows and Attention Sinks

The primitive approach: keep the last N tokens, discard the rest.

The Problem: When the first token evicts, the model destabilizes. Those initial tokens function as "attention sinks"—collecting excess attention mass when tokens don't find strong relevance elsewhere. Remove them, and output becomes incoherent.

StreamingLLM Solution: Pin the first few tokens permanently. Context becomes:

[Initial Tokens] + [Rolling Recent Window]

This lets models trained on 4K windows generalize to millions of tokens without fine-tuning. Solves fluency. Doesn't solve memory—facts that slide out are gone.

Summarization

Convert high-resolution text into lower-resolution semantic representations. Trade precision for longevity.

Incremental: Trigger when threshold reached. Condense oldest K messages into summary, prepend to remaining raw messages.

Recursive (Hierarchical): Split text into chunks, summarize each, then summarize the summaries. Creates a tree—root is high-level concept, leaves are details. Agent traverses: read summary, drill down as needed.

The Trade-off: "Summary Drift" or the "Telephone Game" effect. As summaries of summaries accumulate, specifics smooth out. Dates, names, error codes—lost in favor of "gist."

When to Trigger:

Token threshold (simple)
Semantic shift detection (embed turns, monitor cosine distance—spike indicates topic change)
User-initiated ("Shall I summarize our progress?")

Compression

Unlike summarization, compression reduces token count while preserving exact semantic meaning. Often extractive, not generative.

LLMLingua: Uses a small model (Llama-7B) to calculate perplexity of each token in the prompt. Low perplexity = expected = low information. High perplexity = signal.

Drop the low-perplexity tokens (stopwords, filler, redundant adjectives). Achieve 20x compression with minimal accuracy loss—particularly effective for RAG, where documents contain high redundancy.

LLMLingua-2: Trained encoder for token classification (keep vs. drop), better preserves named entities and negations.

Importance Scoring

Not all context is equal. Weight messages by importance, creating a priority queue.

H2O (Heavy Hitter Oracle): Identifies tokens frequently attended to during generation. Evicts tokens with low accumulated attention scores.

Result: dynamic sliding window shaped by importance, not recency. Critical instruction from 50 turns ago retained because it's a "heavy hitter." Casual greeting from 2 turns ago evicted.

Production Systems

The taxonomy translates into hybrid architectures in real systems.

MemGPT / Letta: The Operating System

MemGPT reframes the LLM as an OS kernel managing virtual memory. It creates an illusion of infinite memory through paging.

Three-Tier Hierarchy:

Core Memory (RAM): Always in context. Structured, mutable state—persona block, user facts.
Recall Memory (Queue): Recent transactions, searchable, eventually evicted.
Archival Memory (Disk): Vector database with full history.

The Innovation: The LLM manages its own memory via tool calls.

core_memory_replace: User says "I'm moving to New York." Agent detects state change, updates Location field.
archival_memory_search: User asks about something from last week. Agent pages it from disk.

Heartbeat Control Flow: User input doesn't force immediate reply. The model chains reasoning steps—search archives, update memory, then respond. Trades latency for coherence.

ChatGPT Memory: Background Consolidation

Different architecture. No explicit paging.

Dual Layer:

User Profile: Structured facts ("User prefers concise answers")
Episodic Memory: Vector store of conversation logs

The Process:

After session (or async), background process extracts durable facts
Updates User Profile namespace
At new conversation start, relevant facts inject into system prompt

This explains why ChatGPT "knows" you across sessions without reading full history. It decouples "facts about the user" (small, high-value) from "conversation logs" (massive, noisy).

Cursor: Shadow Workspace

State-of-the-art for code context.

Shadow Workspace: Hidden, headless VS Code instance running in parallel. Runs Language Server Protocol (LSP).

Context Gathering: AI can "go to definition" in the shadow workspace, read function signatures from other files, pull specific chunks into context. The model sees lints and type definitions before showing code to the user.

Codebase Indexing:

Chunk by logical blocks (AST parsing—functions/classes, not lines)
Decompose queries into sub-queries
Budget allocation: active file (high), imported definitions (medium), related docs (low)
Rerank: retrieve ~50 chunks, score relevance, discard the noise

Glean / Notion: Enterprise Search

Knowledge Graph: Beyond vector similarity. Maps relationships (Person → worked on → Document). Personalized graph weights retrieval by team, manager, viewing history.

Context Sufficiency Scoring: Before answering, evaluate if retrieved context is sufficient. Low confidence → ask clarifying questions rather than hallucinate.

Semantic-Aware Chunking: Align chunk boundaries with document structure (headers, paragraphs), not arbitrary token counts. Prevents splitting concepts in half.

Context Caching (Game Changer)

Claude Projects and Gemini Cache fundamentally change the economics.

The Mechanism: Provider caches KV pairs for the prompt prefix—system instructions, static documents. Subsequent requests only compute KV for the new query.

The Impact: Up to 90% latency reduction, 50-90% cost reduction on cached portion.

Architecture Shift: This revives context stuffing—for stable prefixes. Instead of aggressive summarization (which loses data), stuff the entire codebase into a cached prefix.

The Pattern:

Static Block (docs, core rules) → Cached
Dynamic Block (recent conversation) → Sliding window
Query Block (user input) → Uncached

Constraint: prefix changes invalidate cache. Dynamic history can't be cached. Design for stability.

Context Packing

Once chunks are retrieved, how you arrange them matters.

Lost in the Middle Mitigation

Place the most relevant chunk last (closest to user question). Place the second most relevant first (closest to system prompt). Fill the middle with lower-confidence chunks.

This exploits primacy and recency biases of the attention mechanism.

Format Engineering

Use explicit XML tags:

<document title="API Reference">
  content here
</document>

This prevents "context bleeding"—the model confusing retrieved document content with its own instructions.

The Boundaries

Reasoning Horizon Limit

Summarization is lossy. A 50-turn debug session summarized loses the negative constraints: "I tried X and it failed because Y." Summary says "Tried X, failed." The agent loses why, may retry the same failing strategy.

No amount of compression preserves perfect resolution of 100K tokens in 1K tokens.

Hallucination of Continuity

Systems like MemGPT create an illusion of continuity. The model "wakes up" anew at each step, reading reconstructed memory. It doesn't remember—it reads about remembering.

If reconstruction is flawed, persona fractures. This distinction matters for applications requiring consistent psychological modeling.

Cost of Compression

LLMLingua runs a 7B model to compress prompts for the 4o call. That compute adds latency. For real-time voice agents, unacceptable.

No free lunch. Pay in tokens, or pay in compute.

The Shift: Context Engineering

The role of the AI engineer is shifting from prompt engineering to context engineering.

The future isn't larger windows. It's smarter orchestration of the windows we have. And crucially, the context problem isn't just technical—it's strategic. At the platform level, whoever solves context fragmentation owns the agent value chain. But as the Context Aggregator analysis argues, this fragments by domain: legal context consolidates to Harvey, healthcare to Abridge, coding to Cursor. There may be no universal aggregator.

The Three Pillars:

Hierarchical Storage: Fast KV cache (short-term) + vector DB (long-term) + knowledge graph (structured)
Adaptive Compression: Compress based on query complexity, not fixed rules
Active Forgetting: Intelligently prune stale/resolved context to keep the window clean for new reasoning

The Context Crisis is a data availability problem. The solution isn't a bigger pipe—it's a smarter filter.

The Context Crisis: What to Do When Your Agent Runs Out of Room