Co-authored with Claude. Final editorial judgment and accountability are Alex's.
Pricing as of April 2026. All figures use Opus 4.6 API rates, warm cache unless noted. Subscription plans (Pro, Max) package this spend into usage allocations — the mechanics still apply.
This is the deep reference for the Economics of Claude Code series. For the story, the math, and the fixes, start with the series overview.
The Cost Drivers
Ten things that move the meter, ordered roughly by how much they hurt when nobody's watching. For each one:
- What it is
- Observable / Controllable: whether you can actually see the cost and whether you can actually change it
- How to manage: the levers
- How to monitor: what to watch for
"Observable" means the cost leaves a trace you can see (a token count, a billing line, a session log). "Controllable" means you can actually change your behavior or configuration to move the number. Some things are observable but not very controllable (model list prices), some are controllable but not observable (interaction rhythm), and a few are neither. Those last ones are the ones to be most careful about.
1. Front-loaded context (system prompts, CLAUDE.md, project instructions)
What it is. Everything that gets loaded into the conversation before you send your first real message. Claude's own system prompt, tool definitions, the CLAUDE.md file at the root of the project (and any parent directories), any --append-system-prompt content. This entire prefix gets cached once at a slight write premium, then read cheaply on every subsequent turn.
The obvious strategy is "cram everything in there." But every token you front-load makes every subsequent message more expensive, because the full cached prefix gets reread on every turn. A 50,000-token CLAUDE.md means every message, even a terse "yes, do it," reprocesses all 50,000 tokens at cache-read rates. Cheap per token, but it adds up across 40 turns.
On the other hand, if you leave important context out and Claude has to ask for it mid-conversation (or read a file to figure it out), those clarification rounds add tokens permanently. They get reprocessed on every future message too.
Observable: Yes, directly. It's all content you authored or configured. Controllable: Yes, directly. How to manage. Front-load what's relevant to most messages in a session; pull in what's relevant to one task on demand. Coding standards, architectural conventions, and workflow preferences belong in the prefix. The contents of a specific file you need to edit once do not. Keep project-level instructions in the low single-digit thousands of tokens unless you have a strong reason. If you notice yourself re-explaining the same thing mid-conversation, promote it to the prefix. If the prefix covers things that only matter one time in ten, demote it to an on-demand file. How to monitor. Count tokens on your CLAUDE.md and any system-prompt content. Multiply by a typical turn count (say, 30). That's the input-side cost floor for a session before any real work happens. Compare fresh-session startup cost across projects. If one project is markedly more expensive per turn, the prefix is usually where the weight is.
2. Cache lifespan (the five-minute rule)
What it is. The prompt cache expires after about five minutes of inactivity. The first message after that gap re-processes the entire conversation at the full input rate plus a cache-write premium, 12.5x the cost of the same message with a warm cache.
Observable: Indirectly. Your billing will show elevated input costs in aggregate, but there's no live indicator that says "your cache just died." Controllable: Yes. It's directly tied to your presence and pacing. How to manage. Two good moves when you're about to step away: either send a trivial keep-alive message, or save the session state and start a fresh one when you're ready. Both are cheap. Doing nothing and returning to a cold cache is not. For long gaps (lunch, meetings, end of day), starting fresh is almost always the right call. The cognitive cost of "losing" the context is usually less than the dollar cost of reheating 60 turns of history. How to monitor. Count cache misses per session. A cache miss shows up as a turn where the cache-read token count is near zero while the base input token count is very high. If you see more than one or two of those in a session, your rhythm has a gap problem.
3. Context accumulation
What it is. Every file Claude reads, every tool result, every code block it writes becomes permanent in the conversation until the session ends. A message at turn 5 is cheap. The same message at turn 50 is expensive because Claude rereads everything that came before on every single turn.
Reading a 2,000-line file adds roughly 20,000 input tokens to every subsequent turn for the rest of the session. Reading it again because you forgot what was in it? Two copies.
Observable: Yes. Input token count per turn rises visibly as the session grows. Controllable: Yes. You choose what to pull into context. How to manage. Be deliberate about what you read. Prefer surgical reads (specific line ranges) over reading whole files when you only need a section. Avoid re-reading files you've already pulled in. Claude can search the existing conversation. For exploratory work, use a subagent with a narrow brief rather than bloating the main session. When a session gets long (say, past 30-40 turns or 100K+ tokens of accumulated context), write a handoff note and start fresh. A fresh session with 5,000 tokens of well-written handoff context is cheaper per-turn than a stale session with 200,000 tokens of history. How to monitor. Watch the turn-over-turn growth in input tokens. If it's climbing steeply, something is being pulled in that doesn't need to be. A healthy session has input token counts that grow, but slowly, as you progress from task to task.
4. Message count and turn churn
What it is. Every round trip costs tokens. Claude rereads the full conversation history, then generates a response. The full context is the dominant cost per turn once the session has any real size.
Three separate messages of "ok," "sounds good," and "yes do it" cost three full context rereads. Batching the same feedback into one message cuts the cost by two-thirds with the same outcome.
Observable: Yes. Message count is right there in the session. Controllable: Yes. It's a habit. How to manage. Batch feedback. "Yes, do it. Also change the function name to X and add a test for the edge case" is one turn, not three. Use interruptions sparingly. Aborting a response mid-generation doesn't refund the context-read cost; it just replaces the output you were going to get. Save interruptions for when you genuinely need to redirect, not as a casual "wait, never mind." How to monitor. Watch your messages-per-completed-task ratio. If you're sending ten messages for what should have been two, the churn is burning cache reads on conversation management instead of work.
The cumulative-cost-vs-turns chart above tells the same story from a different angle: every message reprocesses the full context, so message count and context accumulation are two sides of one meter. Per-helper: Just One More Turn.
5. Agent spawning
What it is. When Claude Code delegates to sub-agents, each sub-agent gets its own conversation context and its own cache. The parent session's cache keeps ticking down while the sub-agents work. Three agents in parallel means four separate caches (parent plus three children), each cold-starting.
If the agents take more than five minutes, the parent's cache expires too. You come back from a parallel run and your first message in the parent is a full re-cache of everything.
Observable: Partially. You can see which agents ran and roughly how long they took. Their individual token usage is less visible in most tools today. Controllable: Yes. You choose whether to delegate, what context each agent gets, and whether to keep the parent warm. How to manage. Give each agent a focused brief. A security reviewer needs the code and the security policy, not the full conversation about button colors. Pick model tier per agent (see next driver). When agents will take more than a few minutes, keep the parent session alive with small tasks, or save its state and let the agents finish in the background. Parallel critic passes are often worth the spend. Catching a wrong decision before implementation is much cheaper than fixing it after. Parallel implementation on coupled code is usually not worth it because of merge cost. How to monitor. Track the ratio of agent-time to session-time. If agents are out for long stretches while the parent idles, the parent cache is probably dying repeatedly. If per-agent token usage is high, you're probably over-briefing them.
6. Model tier (Haiku / Sonnet / Opus)
What it is. Claude comes in three tiers. Opus is ~1.7× Sonnet's input cost and 5× Haiku's. The tiers exist for different kinds of work.
A useful analogy: model tiers are billing rates on a consulting team. Haiku is your junior associate: fast, cheap, excellent at well-defined tasks like file lookups and codebase exploration. Sonnet is your senior developer: implementation, debugging, code review, test writing. This is where most spend belongs. Opus is your principal architect: deep analysis, complex decisions, security review, plan critique. Worth it when the task requires that level of reasoning and a bad decision compounds into wasted implementation work.
Observable: Yes. Model choice is explicit and billed per tier. Controllable: Yes. You choose per task. How to manage. Default to Sonnet for standard work. Use Haiku for lightweight scans and lookups. Reserve Opus for planning, architecture, critical review, and security, the places where cheaper planning produces more expensive implementation if the plan is wrong. If your spend mix is dominated by Opus for day-to-day coding, you're paying principal-engineer rates to search for filenames. How to monitor. Break spend down by model. A healthy mix for most work is: majority Sonnet, a meaningful minority Haiku on exploration, and a smaller slice of Opus on strategic calls. If Opus is dominating your spend but you're shipping typical CRUD changes, something is miscalibrated.
7. Clarifying questions and interaction rhythm
What it is. A cost driver that doesn't appear in any pricing table: how the conversation flows. Claude shifts between just-doing-the-work mode and stopping-to-ask mode. Each stop is a round trip. Claude reprocesses the full conversation to generate the question. You read, think, respond. Claude reprocesses the conversation again to act. Two full context reads, and the only output was a decision Claude might have been able to make on its own.
The worst case is when a question makes you stop and think for a while. Five minutes of real consideration can kill the cache. Now you've paid for a question, a cache miss, and an answer. Three context reads for one decision.
The cheapest interaction pattern is when Claude has enough context and confidence to just execute. But there's a tension: you don't want Claude charging ahead on the wrong approach for 20 turns. A question that prevents wasted work is worth it. A question that a well-written prefix would have prevented is not.
Observable: Indirectly. You feel it in the pacing, not in any report. Controllable: Yes, through prompt design and through the prefix. How to manage. When you notice Claude asking the same kind of question across multiple sessions, that's a signal to update the prefix. "What testing framework do you use?" or "should I follow your existing conventions?" are questions a good prefix answers once, forever. Be explicit in your first message about what authority Claude has to proceed. "Fix this and commit" establishes a different rhythm than "take a look and tell me what you'd do." How to monitor. Count deliberation turns (Claude-asks-question, you-answer) versus execution turns. If the ratio is drifting toward more deliberation, either the task is genuinely ambiguous (and maybe needs a planning step) or the prefix is underspecified for the kind of work you're doing.
8. Long-running blocking tool calls
What it is. Any tool call that takes more than a few minutes while Claude waits. Test suites, integration runs, builds, multi-file static analysis. Claude sits idle while the tool runs. The cache ages. When the result comes back, Claude processes it, and if the gap was long enough, that one message triggers a full re-cache.
This is the single most expensive failure mode once sessions get long. A five-minute test run at turn 50 can be the most expensive message of a session.
Observable: Yes. Tool runtime is measurable. Controllable: Yes. You decide what gets called synchronously versus asynchronously. How to manage. Move long-running work to infrastructure that doesn't make Claude wait. CI runs the test suite; Claude reads the result artifact when it's done. Background processes with file-based handoff beat synchronous waits. For smaller tool calls (under a minute), synchronous is fine. For anything approaching or exceeding the five-minute cache window, treat it as a hard architectural boundary. How to monitor. Log tool-call durations. Flag anything that blocks Claude for more than ~3 minutes. Correlate long tool calls with cache-miss counts. That's where the cost concentrates.
9. Interruptions and aborted responses
What it is. Hitting escape mid-generation, canceling a tool call, switching direction before a response completes. Interruption feels free. You didn't "use" the output. But Claude already processed your full context and started generating. You pay for the input read and for whatever output was produced before the cancel.
Observable: Partially. Interrupted generations appear in usage data if you know where to look. Most people don't. Controllable: Yes. It's a habit. How to manage. Use interruptions when a real course correction is needed. Don't use them as a casual "wait, never mind." Those cost the same as a completed response of the same length. If you interrupt frequently, it's usually a symptom of underspecified prompts. How to monitor. If your tooling exposes it, track interruption rate per session. Otherwise, notice the pattern in your own behavior. If you're hitting escape more than a handful of times a day, there's prompt-design work to be done upstream of the interruption.
10. Output length
What it is. Output tokens are priced several times higher than input tokens (roughly 5× on every tier). Verbose output (long explanations, summaries of work already visible in diffs, restated context) adds cost that provides little value.
Observable: Yes, directly. Output tokens are billed separately. Controllable: Partially. You can ask for terseness. The model will drift toward longer responses without explicit guidance. How to manage. Ask for short responses when short is enough. If you want a yes/no or a one-line diagnosis, say so. For coding tasks, ask Claude not to restate what the diff shows. You can read the diff. Put terseness guidance in the prefix if it matters consistently. How to monitor. Compare output-token totals to output-token totals on similar work elsewhere. If a standard bug fix produces a 3,000-token response when 300 would do, the terseness instructions aren't landing.
Benchmarking: Is Something Off in the Way You're Working?
The hardest part of managing Claude spend isn't identifying the drivers; it's knowing whether your numbers are normal for the kind of work you do. There's no universal benchmark. A plan-and-critique pass on a complex system should cost more than fixing a typo. A long research session should have a different token profile than a tight bug-fix loop.
That said, here are the rough benchmarks I use to decide whether a session looks healthy or whether something is off. They aren't hard thresholds. They're the numbers that, when they drift, make me stop and look.
Cache hit rate (cache reads / total input-side tokens). On a healthy multi-turn session, this should be comfortably above 80%. If it's below 60%, you have a cache-miss problem, either from long gaps, excessive front-loading, or a workflow that keeps resetting context. Below 40% on any session longer than a few turns is a red flag that something structural is wrong.
Cache miss count per session. Zero or one on a good session. Two or three on a session where you stepped away once. Anything approaching double digits is a symptom of the wait-and-see rhythm dominating.
Average idle gap between messages. Under two minutes on active work. Four minutes or longer and you're flirting with the TTL on every gap. Over five minutes average and most of your turns are starting cold.
Input token growth per turn. Should grow, but linearly and gently. If the per-turn count is doubling every few turns, something is being pulled in that doesn't need to be there.
Output token ratio. Output should usually be a small fraction of input. If output is climbing to 20% or more of total tokens on routine coding work, the model is being chatty about work that's already visible in the artifacts.
Model spend mix. For most engineering work, Sonnet should be the majority of spend. Haiku a meaningful minority. Opus is strategic, usually less than a third of spend unless you're doing heavy planning or review work. If Opus is dominating your spend on day-to-day coding, the tier choice is miscalibrated.
Cost per completed task. Harder to measure but most informative. Pick a representative task type and track it over time. If the cost per "routine feature" or "typical bug fix" is trending up without the tasks getting harder, something in the workflow is degrading.
Cost per productive message. A rough signal. Divide session cost by the number of messages that actually resulted in code, decisions, or output you kept. If that number is climbing, a lot of your spend is going to conversation management rather than work.
When more than one of these signals is off at the same time, that's the moment to look at how you're working, not just what you're working on. The underlying issue is almost always one of the ten drivers above.