The Hidden Token Tax in Your LLM Config

What Are Tokens?

Every AI model you've ever used (Claude, GPT, Gemini, all of them) doesn't read words. It reads tokens. A token is a chunk of text, roughly 3-4 characters. Some short words are a single token. Longer words get broken into pieces. Code, punctuation, even whitespace; it all gets tokenized.

The model has a fixed token budget for each conversation. Everything it reads and everything it writes counts against that budget. This is the context window.

A word ~1 token

A line of code ~10 tokens

A full file ~500-2,000 tokens

A screenshot ~1,000+ tokens

Different models have different budgets. Some handle a couple hundred thousand tokens, others over a million. But the math is always the same: everything the model sees in your conversation eats into that number.

See for yourself: Paste any text into OpenAI's Tokenizer to see exactly how it gets split into tokens. For a deeper look at how models assign probabilities to each token, try the Logprobs Visualizer. Nothing makes token budgets feel real like watching your own config file get chopped into pieces.

What Feeds the Context Window?

Every time you send a message to an AI coding assistant, you're not just sending a message. You're triggering a full assembly. The tool takes your system instructions, your project config, your memory files, the entire conversation history, and your new message, and packs all of it into one massive prompt. Every single turn.

System
Instructions Fixed

Project
Config Fixed

Memory /
Context Fixed

Conversation
History Grows

Your
Message Variable

The diagram above is the simple version. Under the hood, a 200K context window at turn 25 looks like this:

Tools 18K

MCP 48K

System 8K

Config 6K

Mem 3K

History 70K

Free 47K

Built-in tools MCP tool defs System prompt CLAUDE.md config Memory files Conversation history Available space

Tip: In Claude Code, type /context to see your real-time context window breakdown.

The first five segments are fixed overhead that reloads at full size every turn. The conversation history grows with each exchange. Your message is variable. That distinction is everything. It's why sessions eventually run out of room.

The Stateless Reality

Most people assume the AI remembers your conversation. It doesn't.

Every time you send a message, the entire prompt gets rebuilt from scratch. All those layers from the previous section? Assembled fresh and sent to the model like it's seeing everything for the first time.

That's what "stateless" means. No persistent memory between turns. The only reason it feels continuous is that the full conversation gets replayed every single time.

You Send
a Message Text, files, images

Full Prompt
Assembled From scratch

AI Processes
Everything Reads full prompt

Response
Generated Added to history

History Grows,
Repeat Cycle starts over

Repeats Every Turn

The conversation history is the only thing that actually changes between turns. Your config files, memory files, and system instructions reload at full size every single time. Not loaded once. Reloaded every turn.

Why Sessions Die Early

So now you can see why conversations eventually hit a wall. Your config files take up a fixed chunk of space every turn. Your conversation history grows on top of that. Eventually, they add up to more than the context window can hold.

In practice, with 5,000 tokens of fixed config and a 200K window:

Turn	Fixed Overhead	History	Total Used	Remaining
1	5,000	500	5,500	194,500
10	5,000	15,000	20,000	180,000
25	5,000	80,000	85,000	115,000
40	5,000	160,000	165,000	35,000
45	5,000	190,000	195,000	5,000

Most AI coding tools have a safety valve called auto-compacting. When the conversation gets too long, the tool summarizes older turns into a shorter version to buy more runway.

The catch: config files are exempt from compacting. System instructions, CLAUDE.md, memory files; they all reload at full size every turn, before and after compacting. Only the conversation history gets compressed. The fixed overhead? Always there.

This is why configuration file size matters more than you'd expect. 5,000 tokens of config, reloaded 25 times, is 125,000 tokens of cumulative overhead. At 40 turns, that's 200,000. The bigger your fixed overhead, the fewer turns you get before the wall.

What Caching Actually Does

If you've heard about "prompt caching" from Anthropic or OpenAI, it sounds like it should help with this. It doesn't touch capacity. Same window, same fill rate.

Prompt caching is a cost and speed optimization. When the same prefix (system instructions, config files) appears in consecutive requests, the provider can skip re-processing those tokens. You get faster responses and a lower per-token price. But the tokens are still there. The full prompt still loads. The context window still fills up at the same rate.

What Caching Helps

Cost Cached tokens charged at reduced rate

Speed Faster time-to-first-token

Latency Less re-processing per turn

vs

What Caching Does Not Help

Window Same tokens still consume capacity

History Conversation still grows each turn

Overhead Config files still reload at full size

Bottom line: Caching makes your overhead cheaper and faster to process. It does not make it smaller. Your context window capacity is unchanged.

Input Token Cost 10x cheaper

Latency Reduction Up to 85%

Cache Duration 5-10 minutes

Your system instructions, config files, and memory files reload every turn. With caching, the provider recognizes that prefix hasn't changed and serves it from stored KV matrices instead of recomputing from scratch. You still use the same window space, but you pay a fraction of the cost. Honestly, this is a huge part of what makes tools like Claude Code viable; without caching, sending thousands of config tokens every turn would be cost-prohibitive. For a deeper technical look at how this works, ngrok's breakdown covers the mechanics well.

Caching makes repeated overhead dramatically cheaper, but it doesn't give you more room. Both things matter. This guide focuses on the capacity side because that's the part you can actually control.

Phantom Token Consumption

Phantom tokens are tokens in your config that load every turn but do nothing useful. Duplicates. Redundant content. Documentation that could be a one-line pointer to a reference file instead of a full copy.

I audited my own Claude Code setup and found about 6,850 tokens of config loading per turn. After cleanup, it was 5,250. A 23% reduction, zero information lost.

Before Audit

System

800

Duplicate

800

Config

3,700

Memory

1,550

6,850 tokens/turn

-23%

vs

After Audit

System

800

Config

2,950

Memory

1,500

5,250 tokens/turn

1,600 tokens saved per turn

The phantom tokens fell into three categories:

Duplicates

Impact 800 wasted

Cause Windows path quirk loads same file twice

Inline References

Impact ~500 wasted

Fix Replace with pointers to source files

Over-Documentation

Impact ~300 wasted

Fix Remove what source files already say

MCP Servers vs Skills

Not all tool integrations cost the same tokens. MCP servers and skills do similar things, but the way they load into your context window is completely different.

MCP servers register every tool definition up front. The full schema for every tool loads into your context on every single turn, whether you use those tools or not. If you've got a server with 20 tools and you only use 2 of them regularly, the other 18 are phantom overhead.

Skills work differently. They register with just a short description (a few tokens). The full prompt only loads when you actually invoke the skill. Same capability, fraction of the per-turn cost.

MCP Servers

All tool definitions load every turn

Unused tools still consume context space

Can be 20%+ of your total context window

vs

Skills

Short description registers (a few tokens)

Full prompt loads only when invoked

Near-zero idle cost per turn

This is worth auditing. Check your /context breakdown; if MCP tool definitions are eating 20%+ of your window, ask yourself how many of those tools you actually use in a typical session. The ones you don't? That's pure phantom overhead, reloading every turn for nothing.

The takeaway: MCP servers are an all-or-nothing context cost. If a skill can do the same job (or a tool supports deferred loading), you'll cut real per-turn overhead.

How to Audit Your Setup

If you're using any AI coding tool with custom configuration, this is how you find and cut phantom tokens.

Audit Steps

1

Count Your Preloaded Tokens

Paste your config files into a token counter. Know the baseline cost before you start optimizing.

2

Check for Duplicates

Look for files loading twice. On Windows, path casing differences can cause silent duplication.

3

Find Inline References

Identify content copied into config that already exists in source files. Replace with one-line pointers.

4

Test Removal Impact

Remove content, run your normal workflow, and check: does output quality actually change?

What You Gain

More Turns Per Session

Fewer fixed tokens means more room for conversation before hitting the wall.

Lower Cost Per Message

Input tokens are billed. Fewer tokens reloading per turn means real savings at scale.

Faster Responses

Less to process means quicker time-to-first-token on every turn.

Cleaner Config

A leaner config is easier to maintain and less likely to confuse the model with contradictory instructions.

Where to look, by tool:

Claude Code

CLAUDE.md files (global + project)

MEMORY.md + topic memory files

MCP server definitions

Cursor

.cursorrules file

Context files and @-references

Project-level instructions

GitHub Copilot

.github/copilot-instructions.md

Custom instructions in settings

Workspace context files

Any AI Tool

System prompt / custom instructions

Attached reference documents

Pinned context or files

The golden rule: If content exists in a source file the AI can read on demand, don't duplicate it in configuration that loads every turn. Point to it instead.

The Takeaway

All that config you've set up (your CLAUDE.md, memory files, system instructions) reloads from scratch on every single turn. That's the overhead. So the real question is simple: is every token that reloads actually earning its keep?

Duplicate files, verbose reference material that could be a one-line pointer, documentation that just restates what's already in the source code. Those are tokens you're paying for on every message without getting anything back.

It's not dramatic. It's just housekeeping that adds up.

On the cost side, prompt caching has that covered. The providers have made repeated overhead cheap to process. Your job is making sure what's repeating is worth repeating.

The Hidden Token Taxin Your LLM Config

What Are Tokens?

What Feeds the Context Window?

The Stateless Reality

Why Sessions Die Early

What Caching Actually Does

What Caching Helps

What Caching Does Not Help

Phantom Token Consumption

Before Audit

After Audit

MCP Servers vs Skills

MCP Servers

Skills

How to Audit Your Setup

Count Your Preloaded Tokens

Check for Duplicates

Find Inline References

Test Removal Impact

More Turns Per Session

Lower Cost Per Message

Faster Responses

Cleaner Config

The Takeaway

Discussion

The Hidden Token Tax
in Your LLM Config