~8 min read

The Hidden Token Tax
in Your LLM Config

A visual guide to context windows, phantom tokens, and the overhead you're paying every turn

Find Your Phantom Tokens
Scroll to explore

What Are Tokens?

Every AI model you've ever used (Claude, GPT, Gemini, all of them) doesn't read words. It reads tokens. A token is a chunk of text, roughly 3-4 characters. Some short words are a single token. Longer words get broken into pieces. Code, punctuation, even whitespace; it all gets tokenized.

The model has a fixed token budget for each conversation. Everything it reads and everything it writes counts against that budget. This is the context window.

A word ~1 token
A line of code ~10 tokens
A full file ~500-2,000 tokens
A screenshot ~1,000+ tokens

Different models have different budgets. Some handle a couple hundred thousand tokens, others over a million. But the math is always the same: everything the model sees in your conversation eats into that number.

See for yourself: Paste any text into OpenAI's Tokenizer to see exactly how it gets split into tokens. For a deeper look at how models assign probabilities to each token, try the Logprobs Visualizer. Nothing makes token budgets feel real like watching your own config file get chopped into pieces.

What Feeds the Context Window?

Every time you send a message to an AI coding assistant, you're not just sending a message. You're triggering a full assembly. The tool takes your system instructions, your project config, your memory files, the entire conversation history, and your new message, and packs all of it into one massive prompt. Every single turn.

System
Instructions
Fixed
Project
Config
Fixed
Memory /
Context
Fixed
Conversation
History
Grows
Your
Message
Variable

The diagram above is the simple version. Under the hood, a 200K context window at turn 25 looks like this:

Tools 18K
MCP 48K
System 8K
Config 6K
Mem 3K
History 70K
Free 47K
Built-in tools MCP tool defs System prompt CLAUDE.md config Memory files Conversation history Available space

Tip: In Claude Code, type /context to see your real-time context window breakdown.

The first five segments are fixed overhead that reloads at full size every turn. The conversation history grows with each exchange. Your message is variable. That distinction is everything. It's why sessions eventually run out of room.

The Stateless Reality

Most people assume the AI remembers your conversation. It doesn't.

Every time you send a message, the entire prompt gets rebuilt from scratch. All those layers from the previous section? Assembled fresh and sent to the model like it's seeing everything for the first time.

That's what "stateless" means. No persistent memory between turns. The only reason it feels continuous is that the full conversation gets replayed every single time.

You Send
a Message
Text, files, images
Full Prompt
Assembled
From scratch
AI Processes
Everything
Reads full prompt
Response
Generated
Added to history
History Grows,
Repeat
Cycle starts over
Repeats Every Turn

The conversation history is the only thing that actually changes between turns. Your config files, memory files, and system instructions reload at full size every single time. Not loaded once. Reloaded every turn.

Why Sessions Die Early

So now you can see why conversations eventually hit a wall. Your config files take up a fixed chunk of space every turn. Your conversation history grows on top of that. Eventually, they add up to more than the context window can hold.

In practice, with 5,000 tokens of fixed config and a 200K window:

Turn Fixed Overhead History Total Used Remaining
15,0005005,500194,500
105,00015,00020,000180,000
255,00080,00085,000115,000
405,000160,000165,00035,000
455,000190,000195,0005,000

Most AI coding tools have a safety valve called auto-compacting. When the conversation gets too long, the tool summarizes older turns into a shorter version to buy more runway.

The catch: config files are exempt from compacting. System instructions, CLAUDE.md, memory files; they all reload at full size every turn, before and after compacting. Only the conversation history gets compressed. The fixed overhead? Always there.

This is why configuration file size matters more than you'd expect. 5,000 tokens of config, reloaded 25 times, is 125,000 tokens of cumulative overhead. At 40 turns, that's 200,000. The bigger your fixed overhead, the fewer turns you get before the wall.

What Caching Actually Does

If you've heard about "prompt caching" from Anthropic or OpenAI, it sounds like it should help with this. It doesn't touch capacity. Same window, same fill rate.

Prompt caching is a cost and speed optimization. When the same prefix (system instructions, config files) appears in consecutive requests, the provider can skip re-processing those tokens. You get faster responses and a lower per-token price. But the tokens are still there. The full prompt still loads. The context window still fills up at the same rate.

What Caching Helps

Cost Cached tokens charged at reduced rate
Speed Faster time-to-first-token
Latency Less re-processing per turn
vs

What Caching Does Not Help

Window Same tokens still consume capacity
History Conversation still grows each turn
Overhead Config files still reload at full size
Bottom line: Caching makes your overhead cheaper and faster to process. It does not make it smaller. Your context window capacity is unchanged.
Input Token Cost 10x cheaper
Latency Reduction Up to 85%
Cache Duration 5-10 minutes

Your system instructions, config files, and memory files reload every turn. With caching, the provider recognizes that prefix hasn't changed and serves it from stored KV matrices instead of recomputing from scratch. You still use the same window space, but you pay a fraction of the cost. Honestly, this is a huge part of what makes tools like Claude Code viable; without caching, sending thousands of config tokens every turn would be cost-prohibitive. For a deeper technical look at how this works, ngrok's breakdown covers the mechanics well.

Caching makes repeated overhead dramatically cheaper, but it doesn't give you more room. Both things matter. This guide focuses on the capacity side because that's the part you can actually control.

Phantom Token Consumption

Phantom tokens are tokens in your config that load every turn but do nothing useful. Duplicates. Redundant content. Documentation that could be a one-line pointer to a reference file instead of a full copy.

I audited my own Claude Code setup and found about 6,850 tokens of config loading per turn. After cleanup, it was 5,250. A 23% reduction, zero information lost.

Before Audit

System
800
Duplicate
800
Config
3,700
Memory
1,550

6,850 tokens/turn

-23%
vs

After Audit

System
800
Config
2,950
Memory
1,500

5,250 tokens/turn

1,600 tokens saved per turn

The phantom tokens fell into three categories:

Duplicates
Impact 800 wasted
Cause Windows path quirk loads same file twice
Inline References
Impact ~500 wasted
Fix Replace with one-line pointers
Over-Documentation
Impact ~300 wasted
Fix Remove what source files already say

MCP Servers vs Skills

Not all tool integrations cost the same tokens. MCP servers and skills do similar things, but the way they load into your context window is completely different.

MCP servers register every tool definition up front. The full schema for every tool loads into your context on every single turn, whether you use those tools or not. If you've got a server with 20 tools and you only use 2 of them regularly, the other 18 are phantom overhead.

Skills work differently. They register with just a short description (a few tokens). The full prompt only loads when you actually invoke the skill. Same capability, fraction of the per-turn cost.

MCP Servers

All tool definitions load every turn
Unused tools still consume context space
Can be 20%+ of your total context window
vs

Skills

Short description registers (a few tokens)
Full prompt loads only when invoked
Near-zero idle cost per turn

This is worth auditing. Check your /context breakdown; if MCP tool definitions are eating 20%+ of your window, ask yourself how many of those tools you actually use in a typical session. The ones you don't? That's pure phantom overhead, reloading every turn for nothing.

The takeaway: MCP servers are an all-or-nothing context cost. If a skill can do the same job (or a tool supports deferred loading), you'll cut real per-turn overhead.

How to Audit Your Setup

If you're using any AI coding tool with custom configuration, this is how you find and cut phantom tokens.

Audit Steps
1

Count Your Preloaded Tokens

Paste your config files into a token counter. Know the baseline cost before you start optimizing.

2

Check for Duplicates

Look for files loading twice. On Windows, path casing differences can cause silent duplication.

3

Find Inline References

Identify content copied into config that already exists in source files. Replace with one-line pointers.

4

Test Removal Impact

Remove content, run your normal workflow, and check: does output quality actually change?

What You Gain

More Turns Per Session

Fewer fixed tokens means more room for conversation before hitting the wall.

Lower Cost Per Message

Input tokens are billed. Fewer tokens reloading per turn means real savings at scale.

Faster Responses

Less to process means quicker time-to-first-token on every turn.

Cleaner Config

A leaner config is easier to maintain and less likely to confuse the model with contradictory instructions.

Where to look, by tool:

Claude Code
CLAUDE.md files (global + project)
MEMORY.md + topic memory files
MCP server definitions
Cursor
.cursorrules file
Context files and @-references
Project-level instructions
GitHub Copilot
.github/copilot-instructions.md
Custom instructions in settings
Workspace context files
Any AI Tool
System prompt / custom instructions
Attached reference documents
Pinned context or files
The golden rule: If content exists in a source file the AI can read on demand, don't duplicate it in configuration that loads every turn. Point to it instead.

The Takeaway

All that config you've set up (your CLAUDE.md, memory files, system instructions) reloads from scratch on every single turn. That's the overhead. So the real question is simple: is every token that reloads actually earning its keep?

Duplicate files, verbose reference material that could be a one-line pointer, documentation that just restates what's already in the source code. Those are tokens you're paying for on every message without getting anything back.

It's not dramatic. It's just housekeeping that adds up.

On the cost side, prompt caching has that covered. The providers have made repeated overhead cheap to process. Your job is making sure what's repeating is worth repeating.

1 / 9