Context as a Resource

In a relational database, storage is cheap and elastic — you can add disks. In a distributed system, capacity can be scaled out by adding servers. But in an AI system, every interaction is bounded by a single, hard constraint: the context window — the fixed amount of text a language model can read and reason about in a single call.

Context is the most finite resource in an AI system. Unlike compute or storage, you cannot provision more of it at runtime. When the window fills, the model starts forgetting — or stops responding altogether. Managing it well is the difference between an AI application that stays coherent over long sessions and one that degrades silently after a few minutes of conversation.

This tutorial explains what the context window is, what breaks when you mismanage it, and the three strategies production AI systems use to keep long-running sessions within budget.

The Context Window#

A language model's context window is its working memory — the total amount of text, measured in tokens, that the model can reference when generating a response. Everything outside the window is invisible to the model. There is no partial visibility, no paging, and no implicit history. If a fact is not in the context window when the model generates a response, the model simply does not know it.

A token is roughly 3–4 characters of English text, or about ¾ of a word. "Hello world" is 2 tokens. "The user asked about machine learning" is about 7 tokens. A 200,000-token context window can hold approximately 150,000 words — roughly 500 pages of prose.

But the context window is not an empty container waiting to be filled with a conversation. In a real AI application, multiple components compete for that finite budget simultaneously:

What Fills a Context Window

Every token in the context window is a token that cannot be used for something else. In a production AI application, four distinct components compete for the same fixed budget: the system prompt, conversation history, retrieved documents, and the current request. Understanding this competition is the first step to managing context effectively.

Rendering diagram...

Context Window Sizes in Practice#

As of 2026, context windows have grown dramatically — but the effective useful range is smaller than headline numbers suggest:

ModelAdvertised WindowPractical Effective RangeNotes
Claude Sonnet / Opus 4.x200K tokens~100–130K tokens1M-token beta available at 2× input pricing for high-usage accounts
GPT-5.41M tokens~400–600K tokensMassive jump from GPT-4o's 128K; one of the largest commercial context windows
Gemini 3 Pro1M tokens~400–600K tokensSame 1M window as Gemini 2.5 series; practical limits depend on task complexity
Llama 4 (open source)10M tokens~2–5M tokensLargest available context; practical limits depend on hardware
Claude Haiku 4.x200K tokens~100–130K tokensFastest and cheapest; same window size as Sonnet

Why "effective range" is less than advertised: Research consistently shows that LLMs perform reliably only within roughly 50–65% of their advertised context limit. Beyond that threshold, attention becomes less reliable and models begin missing information that is technically within the window. The reasons why are explained in the next section.

Why the Limit Exists: Quadratic Attention#

The context window is bounded by the self-attention mechanism at the core of every transformer model. During inference, every token in the context attends to every other token — comparing itself to every other position to determine relevance. This is an O(n²) operation: if a 1,000-token context requires 1 unit of compute, a 2,000-token context requires ~4 units, and an 8,000-token context requires ~64 units.

Rendering diagram...

Modern implementations like FlashAttention reduce the memory footprint of attention from O(n²) to O(n), which is how today's models support 200K–1M token windows without running out of GPU memory. But the computational cost — the number of operations — remains quadratic. This is why latency and cost continue to scale with context length even as hardware and algorithms improve.

This quadratic growth has three practical consequences for production AI systems:

  • Latency increases with context length. Every additional token in the input adds to the time before the model begins generating its response (the "time to first token"). A full 200K-token context call may take several seconds to start responding; a 20K-token call is typically an order of magnitude faster.
  • Cost scales with token count. API providers pass the compute cost to users as a per-token charge on input. A 200K-token input costs 20× more than a 10K-token input at the same rate per token.
  • Hardware imposes hard ceilings. The maximum context window is ultimately constrained by GPU memory and compute. As hardware improves, windows grow — but the quadratic attention bottleneck means that growing from 200K to 1M tokens requires roughly 25× the attention compute, which must be covered by hardware advances or algorithmic improvements like sparse attention.

What Breaks When You Run Out#

There are two ways to hit the context limit: hard overflow and soft degradation. Both matter for production systems, but soft degradation is more dangerous because it is invisible.

Hard Overflow#

Hard overflow occurs when you send more tokens than the model's maximum accepts. Modern APIs return an explicit error (HTTP 400 validation failure) rather than silently truncating the input. This is intentional: silent truncation is more dangerous than an explicit error, because the model produces a response that appears normal while actually working from incomplete information.

The fix for hard overflow is straightforward: count tokens before sending using the model provider's token-counting API, and apply a context management strategy before the call.

Soft Degradation: The "Lost in the Middle" Problem#

Soft degradation begins well before you hit the hard limit. A landmark 2023 study from Stanford found that LLMs exhibit a U-shaped attention pattern across long contexts: they attend strongly to content at the beginning and end of the context window, and significantly less to content in the middle.

Rendering diagram...

The study found that performance on multi-document question answering dropped by more than 20% for content positioned in the middle of a long context, compared to content at the start or end. In some configurations, performance for middle-positioned content fell below the baseline of providing no documents at all.

The root cause is structural: transformer attention distributes its "attention budget" across all tokens in the context, and as the context grows longer, the relative weight any single token in the middle receives necessarily decreases. This is reinforced by how most modern LLMs encode position. Rotary Position Embedding (RoPE) — used by Llama, Mistral, Gemini, and many other models — encodes the relative distance between tokens, which causes attention scores to decay as tokens get farther apart. The result is a natural recency bias: the model effectively "reads" the context with stronger attention near its edges and weaker attention toward the center.

Practical consequences:

  • Do not place critical instructions in the middle of a long context. Put the most important constraints at the very beginning of the system prompt or at the very end of the most recent turn.
  • When using RAG, put the most relevant retrieved document first — not buried in the middle of a large batch.
  • Do not assume that because content fits in the context window, the model will use it reliably. Position matters as much as presence.
  • The effective useful context is typically 50–65% of the advertised maximum — not the full window.
SymptomLikely CauseWhat To Check
Model forgets instructions given earlier in the conversationContext overflow or 'lost in the middle'Count total tokens; move critical instructions to the end of the system prompt or the end of the latest user turn
Model ignores a retrieved document that clearly contains the answerDocument buried in the middle of a large retrieved batchReorder retrieved results so the most relevant document appears first; reduce the total number of retrieved chunks
API returns a 400 error mentioning token limitsHard context overflowCount tokens before sending; apply sliding window or summarization to trim the context
Model responses become vague or repetitive in long sessionsSoft degradation from accumulated historySummarize or trim history; check total token usage against model limit
Responses get slower and more expensive over a long sessionGrowing context increasing per-call cost and latencyImplement a context management strategy to cap token usage at each turn

Context Management Strategies#

Once you recognize that context is a finite, degradable resource, the design question becomes: how do you keep long-running sessions within a manageable token budget without losing important information?

There are three main strategies, each making a different trade-off between simplicity, memory fidelity, and infrastructure cost.

Strategy 1: Sliding Window#

The simplest approach is to keep only the most recent N turns of the conversation in the context, discarding everything older. The "window" slides forward with each new turn: when turn 21 arrives, turn 1 is removed.

Sliding Window Context Management

Keep only the most recent N turns in context, discarding older history entirely. Simple to implement with predictable token usage, but older information is permanently lost — the model cannot reference a decision or requirement from 30 turns ago.

Rendering diagram...

A practical refinement — pinned messages: Mark certain turns as "always keep" regardless of age: the initial requirements document, a user's stated constraint, a critical architectural decision. The sliding window prunes only unpinned turns. This preserves the most important long-lived context while still bounding total token usage.

Strategy 2: Conversation Summarization#

Instead of discarding old turns, compress them. When the conversation grows long, send the oldest portion to the model and ask it to produce a compact summary, then replace those turns with the summary. The summary preserves the semantic meaning of what was discussed without retaining every word verbatim.

Conversation Summarization

Periodically compress old conversation history into a compact summary. The model receives the summary plus recent verbatim turns — preserving the key facts and decisions from early in the conversation without paying the full token cost of retaining every turn.

Rendering diagram...

Rolling summarization: Rather than summarizing a large batch all at once, a more robust approach is to maintain a running summary that is updated as each new batch of turns is added — using the previous summary as a starting point. This avoids large one-time summarization costs and keeps the summary current.

The update prompt looks like this:

"Here is the conversation summary so far: [summary]. Here are the new turns since the last update: [new turns]. Update the summary to incorporate the new turns, preserving all key decisions, preferences, and open questions."

Strategy 3: Retrieval-Augmented Memory#

For the longest sessions — multi-hour coding projects, persistent user assistants, customer histories spanning weeks — neither discarding nor summarizing old turns is sufficient. The solution is to store the full conversation history externally (in a database or vector store) and retrieve only the semantically relevant portions at inference time.

This applies the same retrieval pattern as RAG (covered in the next tutorial section) to conversation history itself.

Retrieval-Augmented Memory

Store all conversation history in a vector database. At each turn, embed the current query and retrieve only the most semantically similar past turns. Only those relevant snippets are injected into the context — keeping token usage low regardless of how many turns the session has accumulated.

Rendering diagram...

The hybrid pattern: In practice, most production systems combine all three strategies in layers:

  1. Recent turns verbatim — always keep the last 3–5 turns unchanged (the immediate conversational thread)
  2. Retrieval-augmented memory — semantically retrieve relevant older turns from a vector store
  3. Rolling summary — inject a concise rolling summary of the overall conversation as a fallback for anything retrieval misses
  4. Sliding window as the ceiling — never let total context exceed the token budget; prune if the combined layers overflow

This layered approach provides strong short-term coherence (verbatim recent turns), good long-term recall (retrieval), and a safety net (the summary) when retrieval misses something important.

Choosing a Strategy#

StrategyMemory FidelityInfrastructureToken UsageBest For
Sliding WindowLow — old turns permanently lostNonePredictably boundedShort-horizon tasks, real-time Q&A, cost-sensitive high-volume apps
Conversation SummarizationMedium — semantic meaning preserved, exact detail lostExtra LLM call for summarizationLow, grows slowlyLong task sessions, design discussions, onboarding flows
Retrieval-Augmented MemoryHigh — all turns stored, relevant ones retrievedVector database + embedding modelLow and stable (query-driven)Persistent assistants, multi-session history, customer support
Hybrid (all three)Highest — combines strengths of eachFull stackPrecisely controlledProduction AI applications requiring robust long-term coherence

Context in Your AI Coding Workflow#

Context management is not just an infrastructure concern for the AI applications you build — it directly affects how you work with AI coding agents like Claude Code, Cursor, or GitHub Copilot. Every AI coding agent operates under the same constraints: a finite context window, a "lost in the middle" attention pattern, and degraded performance as the session grows.

Context Management in AI Coding Sessions

An AI coding agent's effectiveness is directly tied to how well the context window is used. The same principles that govern production AI systems — finite budget, position effects, and strategic eviction — apply to every coding session. Engineering the context for an agent session is the same discipline as engineering it for a production system.

Rendering diagram...

A note on CLAUDE.md as a context strategy: The CLAUDE.md pattern is, at its core, a context management technique. By encoding architectural decisions, naming conventions, and non-functional requirements in a persistent instruction file rather than in conversation history, you ensure that critical constraints are always in the context window at zero history cost — they are part of the system prompt, not the conversation. Well-written instruction files are among the highest-leverage investments in a production agentic coding workflow.

Summary#

ConceptKey Point
The context windowThe model's working memory — finite, non-expandable at runtime. Everything outside the window is invisible to the model
Token budget competitionSystem prompt, conversation history, retrieved documents, and the current request all compete for the same fixed budget
Quadratic attention costDoubling tokens roughly quadruples attention compute; FlashAttention improves memory efficiency but compute cost remains quadratic — latency and cost still scale with context length
Hard overflowSending more tokens than the model accepts returns an explicit error; count tokens before sending and apply a strategy proactively
Soft degradationPerformance degrades before the hard limit — the 'lost in the middle' effect means models attend poorly to content in the middle of a long context
Effective windowRoughly 50–65% of the advertised limit is reliably useful; position matters as much as presence
Sliding windowKeep last N turns, discard the rest — zero infrastructure, predictable cost, loses long-term memory permanently
Conversation summarizationCompress old history into a compact summary — preserves semantic meaning, requires an extra LLM call, lossy for exact details
Retrieval-augmented memoryStore all history in a vector store, retrieve relevant turns at inference time — highest fidelity, highest infrastructure cost
Hybrid approachCombine all three: recent turns verbatim + retrieved relevant history + rolling summary + sliding window as a ceiling
AI coding workflowUse fresh sessions per task, instruction files for persistent constraints, specific file references, and requirements stated at the start and end of each turn

The engineers who build reliable AI systems do not simply load more context and hope the model figures it out. They design the context window deliberately — the same way they design a database schema or a cache eviction policy — thinking carefully about what goes in, what gets evicted, what gets compressed, and what gets retrieved on demand.

Sources: