A visual guide to context windows, phantom tokens, and the overhead you're paying every turn
Every AI model you've ever used (Claude, GPT, Gemini, all of them) doesn't read words. It reads tokens. A token is a chunk of text, roughly 3-4 characters. Some short words are a single token. Longer words get broken into pieces. Code, punctuation, even whitespace; it all gets tokenized.
The model has a fixed token budget for each conversation. Everything it reads and everything it writes counts against that budget. This is the context window.
Different models have different budgets. Some handle a couple hundred thousand tokens, others over a million. But the math is always the same: everything the model sees in your conversation eats into that number.
Every time you send a message to an AI coding assistant, you're not just sending a message. You're triggering a full assembly. The tool takes your system instructions, your project config, your memory files, the entire conversation history, and your new message, and packs all of it into one massive prompt. Every single turn.
The diagram above is the simple version. Under the hood, a 200K context window at turn 25 looks like this:
Tip: In Claude Code, type /context to see your real-time context window breakdown.
The first five segments are fixed overhead that reloads at full size every turn. The conversation history grows with each exchange. Your message is variable. That distinction is everything. It's why sessions eventually run out of room.
Most people assume the AI remembers your conversation. It doesn't.
Every time you send a message, the entire prompt gets rebuilt from scratch. All those layers from the previous section? Assembled fresh and sent to the model like it's seeing everything for the first time.
That's what "stateless" means. No persistent memory between turns. The only reason it feels continuous is that the full conversation gets replayed every single time.
The conversation history is the only thing that actually changes between turns. Your config files, memory files, and system instructions reload at full size every single time. Not loaded once. Reloaded every turn.
So now you can see why conversations eventually hit a wall. Your config files take up a fixed chunk of space every turn. Your conversation history grows on top of that. Eventually, they add up to more than the context window can hold.
In practice, with 5,000 tokens of fixed config and a 200K window:
| Turn | Fixed Overhead | History | Total Used | Remaining |
|---|---|---|---|---|
| 1 | 5,000 | 500 | 5,500 | 194,500 |
| 10 | 5,000 | 15,000 | 20,000 | 180,000 |
| 25 | 5,000 | 80,000 | 85,000 | 115,000 |
| 40 | 5,000 | 160,000 | 165,000 | 35,000 |
| 45 | 5,000 | 190,000 | 195,000 | 5,000 |
Most AI coding tools have a safety valve called auto-compacting. When the conversation gets too long, the tool summarizes older turns into a shorter version to buy more runway.
The catch: config files are exempt from compacting. System instructions, CLAUDE.md, memory files; they all reload at full size every turn, before and after compacting. Only the conversation history gets compressed. The fixed overhead? Always there.
This is why configuration file size matters more than you'd expect. 5,000 tokens of config, reloaded 25 times, is 125,000 tokens of cumulative overhead. At 40 turns, that's 200,000. The bigger your fixed overhead, the fewer turns you get before the wall.
If you've heard about "prompt caching" from Anthropic or OpenAI, it sounds like it should help with this. It doesn't touch capacity. Same window, same fill rate.
Prompt caching is a cost and speed optimization. When the same prefix (system instructions, config files) appears in consecutive requests, the provider can skip re-processing those tokens. You get faster responses and a lower per-token price. But the tokens are still there. The full prompt still loads. The context window still fills up at the same rate.
Your system instructions, config files, and memory files reload every turn. With caching, the provider recognizes that prefix hasn't changed and serves it from stored KV matrices instead of recomputing from scratch. You still use the same window space, but you pay a fraction of the cost. Honestly, this is a huge part of what makes tools like Claude Code viable; without caching, sending thousands of config tokens every turn would be cost-prohibitive. For a deeper technical look at how this works, ngrok's breakdown covers the mechanics well.
Caching makes repeated overhead dramatically cheaper, but it doesn't give you more room. Both things matter. This guide focuses on the capacity side because that's the part you can actually control.
Phantom tokens are tokens in your config that load every turn but do nothing useful. Duplicates. Redundant content. Documentation that could be a one-line pointer to a reference file instead of a full copy.
I audited my own Claude Code setup and found about 6,850 tokens of config loading per turn. After cleanup, it was 5,250. A 23% reduction, zero information lost.
6,850 tokens/turn
5,250 tokens/turn
The phantom tokens fell into three categories:
Not all tool integrations cost the same tokens. MCP servers and skills do similar things, but the way they load into your context window is completely different.
MCP servers register every tool definition up front. The full schema for every tool loads into your context on every single turn, whether you use those tools or not. If you've got a server with 20 tools and you only use 2 of them regularly, the other 18 are phantom overhead.
Skills work differently. They register with just a short description (a few tokens). The full prompt only loads when you actually invoke the skill. Same capability, fraction of the per-turn cost.
This is worth auditing. Check your /context breakdown; if MCP tool definitions are eating 20%+ of your window, ask yourself how many of those tools you actually use in a typical session. The ones you don't? That's pure phantom overhead, reloading every turn for nothing.
If you're using any AI coding tool with custom configuration, this is how you find and cut phantom tokens.
Paste your config files into a token counter. Know the baseline cost before you start optimizing.
Look for files loading twice. On Windows, path casing differences can cause silent duplication.
Identify content copied into config that already exists in source files. Replace with one-line pointers.
Remove content, run your normal workflow, and check: does output quality actually change?
Fewer fixed tokens means more room for conversation before hitting the wall.
Input tokens are billed. Fewer tokens reloading per turn means real savings at scale.
Less to process means quicker time-to-first-token on every turn.
A leaner config is easier to maintain and less likely to confuse the model with contradictory instructions.
Where to look, by tool:
All that config you've set up (your CLAUDE.md, memory files, system instructions) reloads from scratch on every single turn. That's the overhead. So the real question is simple: is every token that reloads actually earning its keep?
Duplicate files, verbose reference material that could be a one-line pointer, documentation that just restates what's already in the source code. Those are tokens you're paying for on every message without getting anything back.
It's not dramatic. It's just housekeeping that adds up.
On the cost side, prompt caching has that covered. The providers have made repeated overhead cheap to process. Your job is making sure what's repeating is worth repeating.