Your Context Window is Leaking Money: A Practical Guide to Input Token Management

My previous piece on context engineering made the case that what you load into the model matters more than how you ask. The response was clear: people got the argument, but wanted the specifics. How do you actually manage what goes into the context window? Where are the tokens going? What do you cut, and what do you keep?

This is the practical follow-up. Six techniques, grounded in production data, that determine whether your system is spending tokens on reasoning or burning them on noise.

The problem is bigger than you think

Most development teams squander 40-60% of their token budgets on suboptimal implementations. That is not a typo. In practice, the waste comes from three places: bloated system prompts that include every tool definition on every request, retrieval pipelines that dump entire documents when a paragraph would suffice, and conversation histories that replay the full transcript when only the last few turns matter.

The cost is not just financial. Chroma’s 2025 research tested 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3, among others — and found that every single one performed worse as input length increased. This is not a near-capacity problem. Degradation begins well before you hit the limit. A model with a 1M-token window still exhibits measurable accuracy loss at 50K tokens.

The researchers coined the term “context rot” for this phenomenon: the more you load into the window, the less reliably the model attends to any individual piece. Performance follows a U-shaped curve — high accuracy for information at the start and end of the context, but 30%+ lower accuracy for anything buried in the middle.

The implication is direct. Stuffing more into the context window does not just cost more money. It makes the system worse.

For the rest of us: what are input tokens?

Every time you send something to an AI model, the text gets broken into small pieces called tokens — roughly one token per word, sometimes less. The “context window” is the total number of tokens the model can hold at once: your instructions, any documents or data you include, the conversation history, and the model’s own response.

Input tokens are everything you send in. Output tokens are what the model sends back. You pay for both, but input tokens are where most of the money goes — and most of the waste hides.

Think of it like RAM in a computer. You have a fixed amount. If you fill it with unnecessary programs, the computer slows down. Same thing here: fill the context window with irrelevant information, and the model gets slower, more expensive, and less accurate.

Technique 1: Token budgeting

The first step is treating your context window as an explicit budget rather than an open container.

Fig. 01 · Context window allocation

Four zones, four levers,
most of the savings on the table.

01
Stable prefix 15–25%
- System prompt
- Tool schemas
- Safety guardrails
Cache the prefix 90% read discount
02
Retrieved context 30–45%
- RAG results
- Search snippets
- Tool outputs
Compress & rank 50–70% reducible
03
Conversation + query 20–35%
- Recent turns
- Summarised history
- Current request
Summarise old turns 80%+ compressible
04
Reserved for output 10–20%

Leave headroom. A response that runs out of tokens mid-thought costs more to retry than to budget for in the first place.

40–60%

tokens wasted in typical systems

30%+

accuracy drop from context rot

70–80%

cost reduction with optimisation

Most context budgets are built once and then ignored. The savings sit at the boundary of every zone, untouched.

Allocate specific percentages to each component: 15-25% for the system prompt and tool schemas (the stable prefix), 30-45% for retrieved context (RAG results, search snippets, tool outputs), 20-35% for conversation history and the current query, and 10-20% reserved for the model’s response.

If one component grows beyond its allocation, it gets compressed or truncated before the request goes out. Without this discipline, conversation history alone can consume 5,000-10,000 tokens over a 20-turn session when only 500-1,000 tokens of recent context would typically suffice.

This is bookkeeping, not engineering. But it is the foundation everything else builds on.

Technique 2: Prompt caching

If you are making multiple requests with the same system prompt, tool definitions, and base instructions, you are reprocessing identical tokens on every call. Prompt caching eliminates this.

The mechanism is straightforward. The model generates a key-value (KV) cache — a computed representation of the tokens it has already processed. With caching enabled, that computed state is stored and reused rather than recalculated from scratch on the next request.

The economics are significant. Anthropic’s prompt caching charges 0.1x the base input price for cached reads — a 90% discount. The catch: caches have a 5-minute TTL by default (Anthropic quietly shortened this from 60 minutes in early 2026), so your architecture needs to be designed around stable prefixes that get reused within that window.

The practical rule: structure your prompts so that everything stable comes first (system instructions, tool definitions, persona) and everything that changes comes last (conversation history, current query). This maximises cache hit rates because the expensive, repeated sections are always identical.

Technique 3: Dynamic tool selection

This is the one most teams miss entirely.

A typical agent system defines 30-50 tools with their full JSON schemas. Anthropic’s own data shows 50 MCP tools consume roughly 72K tokens just for the tool definitions alone. That is context window space that cannot be used for reasoning, memory, or output.

The fix: load tools dynamically based on the current request, not statically on every call. A tool search that returns 3-5 relevant tools at roughly 3K tokens drops context usage from 77K to under 9K — an 85% reduction from a single architectural change.

Recent research on Instruction-Tool Retrieval (ITR) takes this further. By treating both system instructions and tool definitions as retrievable resources — indexed and fetched only when relevant to the current step — teams have achieved up to 95% reduction in per-step context tokens while simultaneously improving tool routing accuracy by 32%.

The threshold matters too. Once you expose more than roughly 30 tools simultaneously, tool descriptions begin overlapping semantically and the model struggles to select the right one. Fewer, more relevant tools in context consistently outperforms a large static toolkit.

Technique 4: Retrieval discipline

RAG pipelines are the second-largest source of token waste, after conversation history. The problem is not retrieval itself — it is the lack of discipline in what gets retrieved and how much of it reaches the model.

Teams routinely pass 4-8 full documents into a prompt when only a snippet or paragraph contains the relevant answer. Setting tighter caps — retrieving passages rather than pages, limiting to 3-5 chunks at 256-512 tokens each, re-ranking results by relevance before inclusion — can cut input tokens by more than half with no measurable loss in precision.

Chunking strategy matters more than chunk size. Recursive character splitting that follows natural text boundaries (paragraph breaks, then sentences, then words) consistently outperforms arbitrary fixed-size windows. Semantic chunking — grouping text by meaning rather than structure — takes this further but requires a secondary embedding model.

The positioning of retrieved context also matters. LLMs do not attend to all parts of the input equally. They perform best when critical information appears at the start or end of the context and measurably worse when it is buried in the middle. If you have three retrieved passages and one is clearly most relevant, place it first or last. Never sandwich it.

Technique 5: Conversation compression

A 20-turn conversation replays the entire history on every new request. By turn 20, most of those tokens are redundant — the model already incorporated their information into earlier responses.

The most effective approach is a sliding window with summarisation. Keep the last 3-5 turns verbatim (the model needs recent context for coherence) and replace everything older with a compressed summary of key facts, decisions, and open threads. This preserves conversational continuity while reducing history tokens by 80% or more.

For agent-based systems, dedicated memory stores are more efficient still. Rather than replaying conversation history, agents extract and persist relevant facts in structured storage — a database, a file system, a vector store — and retrieve only what the current task requires. Anthropic’s memory features let agents keep curated facts server-side, replacing tens of thousands of replayed-history tokens with targeted retrievals.

The compression does not need to be perfect. It needs to preserve the information the model actually needs for the next response. In practice, that is far less than the full transcript.

Technique 6: Prompt compression

For contexts that are already well-structured but simply too long, algorithmic compression offers a direct token reduction with minimal quality loss.

LLMLingua, developed by Microsoft Research, uses a small language model to identify which tokens carry the most information and removes the rest. The results are striking: up to 20x compression with only a 1.5-point performance drop on reasoning benchmarks. GPT-4 can recover all 9 reasoning steps from a chain-of-thought prompt that has been compressed to 5% of its original length.

In production, Factory tested three compression approaches on 36,000 messages from real Claude Code coding sessions — the only large-scale evaluation on actual agent workloads rather than academic benchmarks. All three methods achieved 98%+ compression ratios, meaning the models could work with dramatically compressed inputs without meaningful degradation.

This is not a replacement for the other techniques. It is a final pass for contexts that are already optimised but still too large. Compress retrieval results before inclusion. Compress older conversation summaries further. Compress tool output that is verbose by nature.

Putting it together

None of these techniques work in isolation. A well-optimised system applies them in layers:

Budget first. Know where your tokens are going before you optimise anything.
Cache the stable prefix. System prompt, base instructions, persona. 90% savings on every request.
Load tools dynamically. 85% reduction in tool context. Better selection accuracy.
Discipline retrieval. Fewer, shorter, better-ranked passages. Position them at the edges.
Compress conversation. Sliding window plus summary. 80%+ reduction in history tokens.
Compress what remains. Algorithmic compression as a final pass on large contexts.

Teams that systematically apply these techniques report 70-80% cost reductions while maintaining or improving output quality. The improvement in quality is not a paradox — less noise in the context window means the model attends more effectively to what remains.

The context window is not an open container. It is an architecture decision. Manage it like one.

References

Chroma Research, “Context Rot: How Increasing Input Tokens Impacts LLM Performance” (2025)
Microsoft Research, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models” (2023)
Factory, Prompt Compression Evaluation on 36,000 Claude Code Messages (2026)
Anthropic, Prompt Caching Documentation (2026)
Obvious Works, “Token Optimization 2026: Saving Up to 80% LLM Costs”
Redis, “LLM Token Optimization: Cut Costs & Latency in 2026”
Lunar.dev, “Dynamic Tool Selection for AI Agents” (2026)
arxiv, “Dynamic System Instructions and Tool Exposure for Efficient Agentic LLMs” (2026)