Context as a Resource
In a relational database, storage is cheap and elastic — you can add disks. In a distributed system, capacity can be scaled out by adding servers. But in an AI system, every interaction is bounded by a single, hard constraint: the context window — the fixed amount of text a language model can read and reason about in a single call.
Context is the most finite resource in an AI system. Unlike compute or storage, you cannot provision more of it at runtime. When the window fills, the model starts forgetting — or stops responding altogether. Managing it well is the difference between an AI application that stays coherent over long sessions and one that degrades silently after a few minutes of conversation.
This tutorial explains what the context window is, what breaks when you mismanage it, and the three strategies production AI systems use to keep long-running sessions within budget.
The Context Window#
A language model's context window is its working memory — the total amount of text, measured in tokens, that the model can reference when generating a response. Everything outside the window is invisible to the model. There is no partial visibility, no paging, and no implicit history. If a fact is not in the context window when the model generates a response, the model simply does not know it.
A token is roughly 3–4 characters of English text, or about ¾ of a word. "Hello world" is 2 tokens. "The user asked about machine learning" is about 7 tokens. A 200,000-token context window can hold approximately 150,000 words — roughly 500 pages of prose.
But the context window is not an empty container waiting to be filled with a conversation. In a real AI application, multiple components compete for that finite budget simultaneously:
What Fills a Context Window
Every token in the context window is a token that cannot be used for something else. In a production AI application, four distinct components compete for the same fixed budget: the system prompt, conversation history, retrieved documents, and the current request. Understanding this competition is the first step to managing context effectively.
Context Window Sizes in Practice#
As of 2026, context windows have grown dramatically — but the effective useful range is smaller than headline numbers suggest:
| Model | Advertised Window | Practical Effective Range | Notes |
|---|---|---|---|
| Claude Sonnet / Opus 4.x | 200K tokens | ~100–130K tokens | 1M-token beta available at 2× input pricing for high-usage accounts |
| GPT-5.4 | 1M tokens | ~400–600K tokens | Massive jump from GPT-4o's 128K; one of the largest commercial context windows |
| Gemini 3 Pro | 1M tokens | ~400–600K tokens | Same 1M window as Gemini 2.5 series; practical limits depend on task complexity |
| Llama 4 (open source) | 10M tokens | ~2–5M tokens | Largest available context; practical limits depend on hardware |
| Claude Haiku 4.x | 200K tokens | ~100–130K tokens | Fastest and cheapest; same window size as Sonnet |
Why "effective range" is less than advertised: Research consistently shows that LLMs perform reliably only within roughly 50–65% of their advertised context limit. Beyond that threshold, attention becomes less reliable and models begin missing information that is technically within the window. The reasons why are explained in the next section.
Why the Limit Exists: Quadratic Attention#
The context window is bounded by the self-attention mechanism at the core of every transformer model. During inference, every token in the context attends to every other token — comparing itself to every other position to determine relevance. This is an O(n²) operation: if a 1,000-token context requires 1 unit of compute, a 2,000-token context requires ~4 units, and an 8,000-token context requires ~64 units.
Modern implementations like FlashAttention reduce the memory footprint of attention from O(n²) to O(n), which is how today's models support 200K–1M token windows without running out of GPU memory. But the computational cost — the number of operations — remains quadratic. This is why latency and cost continue to scale with context length even as hardware and algorithms improve.
This quadratic growth has three practical consequences for production AI systems:
- Latency increases with context length. Every additional token in the input adds to the time before the model begins generating its response (the "time to first token"). A full 200K-token context call may take several seconds to start responding; a 20K-token call is typically an order of magnitude faster.
- Cost scales with token count. API providers pass the compute cost to users as a per-token charge on input. A 200K-token input costs 20× more than a 10K-token input at the same rate per token.
- Hardware imposes hard ceilings. The maximum context window is ultimately constrained by GPU memory and compute. As hardware improves, windows grow — but the quadratic attention bottleneck means that growing from 200K to 1M tokens requires roughly 25× the attention compute, which must be covered by hardware advances or algorithmic improvements like sparse attention.
What Breaks When You Run Out#
There are two ways to hit the context limit: hard overflow and soft degradation. Both matter for production systems, but soft degradation is more dangerous because it is invisible.
Hard Overflow#
Hard overflow occurs when you send more tokens than the model's maximum accepts. Modern APIs return an explicit error (HTTP 400 validation failure) rather than silently truncating the input. This is intentional: silent truncation is more dangerous than an explicit error, because the model produces a response that appears normal while actually working from incomplete information.
The fix for hard overflow is straightforward: count tokens before sending using the model provider's token-counting API, and apply a context management strategy before the call.
Soft Degradation: The "Lost in the Middle" Problem#
Soft degradation begins well before you hit the hard limit. A landmark 2023 study from Stanford found that LLMs exhibit a U-shaped attention pattern across long contexts: they attend strongly to content at the beginning and end of the context window, and significantly less to content in the middle.
The study found that performance on multi-document question answering dropped by more than 20% for content positioned in the middle of a long context, compared to content at the start or end. In some configurations, performance for middle-positioned content fell below the baseline of providing no documents at all.
The root cause is structural: transformer attention distributes its "attention budget" across all tokens in the context, and as the context grows longer, the relative weight any single token in the middle receives necessarily decreases. This is reinforced by how most modern LLMs encode position. Rotary Position Embedding (RoPE) — used by Llama, Mistral, Gemini, and many other models — encodes the relative distance between tokens, which causes attention scores to decay as tokens get farther apart. The result is a natural recency bias: the model effectively "reads" the context with stronger attention near its edges and weaker attention toward the center.
Practical consequences:
- Do not place critical instructions in the middle of a long context. Put the most important constraints at the very beginning of the system prompt or at the very end of the most recent turn.
- When using RAG, put the most relevant retrieved document first — not buried in the middle of a large batch.
- Do not assume that because content fits in the context window, the model will use it reliably. Position matters as much as presence.
- The effective useful context is typically 50–65% of the advertised maximum — not the full window.
| Symptom | Likely Cause | What To Check |
|---|---|---|
| Model forgets instructions given earlier in the conversation | Context overflow or 'lost in the middle' | Count total tokens; move critical instructions to the end of the system prompt or the end of the latest user turn |
| Model ignores a retrieved document that clearly contains the answer | Document buried in the middle of a large retrieved batch | Reorder retrieved results so the most relevant document appears first; reduce the total number of retrieved chunks |
| API returns a 400 error mentioning token limits | Hard context overflow | Count tokens before sending; apply sliding window or summarization to trim the context |
| Model responses become vague or repetitive in long sessions | Soft degradation from accumulated history | Summarize or trim history; check total token usage against model limit |
| Responses get slower and more expensive over a long session | Growing context increasing per-call cost and latency | Implement a context management strategy to cap token usage at each turn |
Context Management Strategies#
Once you recognize that context is a finite, degradable resource, the design question becomes: how do you keep long-running sessions within a manageable token budget without losing important information?
There are three main strategies, each making a different trade-off between simplicity, memory fidelity, and infrastructure cost.
Strategy 1: Sliding Window#
The simplest approach is to keep only the most recent N turns of the conversation in the context, discarding everything older. The "window" slides forward with each new turn: when turn 21 arrives, turn 1 is removed.
Sliding Window Context Management
Keep only the most recent N turns in context, discarding older history entirely. Simple to implement with predictable token usage, but older information is permanently lost — the model cannot reference a decision or requirement from 30 turns ago.
A practical refinement — pinned messages: Mark certain turns as "always keep" regardless of age: the initial requirements document, a user's stated constraint, a critical architectural decision. The sliding window prunes only unpinned turns. This preserves the most important long-lived context while still bounding total token usage.
Strategy 2: Conversation Summarization#
Instead of discarding old turns, compress them. When the conversation grows long, send the oldest portion to the model and ask it to produce a compact summary, then replace those turns with the summary. The summary preserves the semantic meaning of what was discussed without retaining every word verbatim.
Conversation Summarization
Periodically compress old conversation history into a compact summary. The model receives the summary plus recent verbatim turns — preserving the key facts and decisions from early in the conversation without paying the full token cost of retaining every turn.
Rolling summarization: Rather than summarizing a large batch all at once, a more robust approach is to maintain a running summary that is updated as each new batch of turns is added — using the previous summary as a starting point. This avoids large one-time summarization costs and keeps the summary current.
The update prompt looks like this:
"Here is the conversation summary so far: [summary]. Here are the new turns since the last update: [new turns]. Update the summary to incorporate the new turns, preserving all key decisions, preferences, and open questions."
Strategy 3: Retrieval-Augmented Memory#
For the longest sessions — multi-hour coding projects, persistent user assistants, customer histories spanning weeks — neither discarding nor summarizing old turns is sufficient. The solution is to store the full conversation history externally (in a database or vector store) and retrieve only the semantically relevant portions at inference time.
This applies the same retrieval pattern as RAG (covered in the next tutorial section) to conversation history itself.
Retrieval-Augmented Memory
Store all conversation history in a vector database. At each turn, embed the current query and retrieve only the most semantically similar past turns. Only those relevant snippets are injected into the context — keeping token usage low regardless of how many turns the session has accumulated.
The hybrid pattern: In practice, most production systems combine all three strategies in layers:
- Recent turns verbatim — always keep the last 3–5 turns unchanged (the immediate conversational thread)
- Retrieval-augmented memory — semantically retrieve relevant older turns from a vector store
- Rolling summary — inject a concise rolling summary of the overall conversation as a fallback for anything retrieval misses
- Sliding window as the ceiling — never let total context exceed the token budget; prune if the combined layers overflow
This layered approach provides strong short-term coherence (verbatim recent turns), good long-term recall (retrieval), and a safety net (the summary) when retrieval misses something important.
Choosing a Strategy#
| Strategy | Memory Fidelity | Infrastructure | Token Usage | Best For |
|---|---|---|---|---|
| Sliding Window | Low — old turns permanently lost | None | Predictably bounded | Short-horizon tasks, real-time Q&A, cost-sensitive high-volume apps |
| Conversation Summarization | Medium — semantic meaning preserved, exact detail lost | Extra LLM call for summarization | Low, grows slowly | Long task sessions, design discussions, onboarding flows |
| Retrieval-Augmented Memory | High — all turns stored, relevant ones retrieved | Vector database + embedding model | Low and stable (query-driven) | Persistent assistants, multi-session history, customer support |
| Hybrid (all three) | Highest — combines strengths of each | Full stack | Precisely controlled | Production AI applications requiring robust long-term coherence |
Context in Your AI Coding Workflow#
Context management is not just an infrastructure concern for the AI applications you build — it directly affects how you work with AI coding agents like Claude Code, Cursor, or GitHub Copilot. Every AI coding agent operates under the same constraints: a finite context window, a "lost in the middle" attention pattern, and degraded performance as the session grows.
Context Management in AI Coding Sessions
An AI coding agent's effectiveness is directly tied to how well the context window is used. The same principles that govern production AI systems — finite budget, position effects, and strategic eviction — apply to every coding session. Engineering the context for an agent session is the same discipline as engineering it for a production system.
A note on CLAUDE.md as a context strategy: The CLAUDE.md pattern is, at its core, a context management technique. By encoding architectural decisions, naming conventions, and non-functional requirements in a persistent instruction file rather than in conversation history, you ensure that critical constraints are always in the context window at zero history cost — they are part of the system prompt, not the conversation. Well-written instruction files are among the highest-leverage investments in a production agentic coding workflow.
Summary#
| Concept | Key Point |
|---|---|
| The context window | The model's working memory — finite, non-expandable at runtime. Everything outside the window is invisible to the model |
| Token budget competition | System prompt, conversation history, retrieved documents, and the current request all compete for the same fixed budget |
| Quadratic attention cost | Doubling tokens roughly quadruples attention compute; FlashAttention improves memory efficiency but compute cost remains quadratic — latency and cost still scale with context length |
| Hard overflow | Sending more tokens than the model accepts returns an explicit error; count tokens before sending and apply a strategy proactively |
| Soft degradation | Performance degrades before the hard limit — the 'lost in the middle' effect means models attend poorly to content in the middle of a long context |
| Effective window | Roughly 50–65% of the advertised limit is reliably useful; position matters as much as presence |
| Sliding window | Keep last N turns, discard the rest — zero infrastructure, predictable cost, loses long-term memory permanently |
| Conversation summarization | Compress old history into a compact summary — preserves semantic meaning, requires an extra LLM call, lossy for exact details |
| Retrieval-augmented memory | Store all history in a vector store, retrieve relevant turns at inference time — highest fidelity, highest infrastructure cost |
| Hybrid approach | Combine all three: recent turns verbatim + retrieved relevant history + rolling summary + sliding window as a ceiling |
| AI coding workflow | Use fresh sessions per task, instruction files for persistent constraints, specific file references, and requirements stated at the start and end of each turn |
The engineers who build reliable AI systems do not simply load more context and hope the model figures it out. They design the context window deliberately — the same way they design a database schema or a cache eviction policy — thinking carefully about what goes in, what gets evicted, what gets compressed, and what gets retrieved on demand.
Sources:
- Lost in the Middle: How Language Models Use Long Contexts (Stanford / UC Berkeley, 2023)
- Context Window Overflow in 2026: Fix LLM Errors Fast — Redis
- Context Engineering Best Practices — Redis
- Effective Context Engineering for AI Agents — Anthropic Engineering
- Context Window Management: Strategies for Long-Context AI Agents — Maxim AI
- LLM Token Limits: Every Model's Context Window Compared (2026) — Morph
- Understanding LLM Performance Degradation — Stefano Demiliani
- The 'Lost in the Middle' Problem — DEV Community
- Context Engineering Guide — LlamaIndex