What is Goodhart's Law and how did it destroy Amazon's KiroRank?

Goodhart's Law states: 'When a measure becomes a target, it ceases to be a good measure.' Amazon implemented an internal leaderboard called KiroRank that tracked and ranked employees by their AI token consumption, setting targets of over 80% weekly usage. The result was a perfect illustration of Goodhart's Law in action: developers began feeding AI trivial, pointless tasks solely to inflate their usage scores. The leaderboard rewarded volume of AI interaction, not value of output. Amazon was forced to take KiroRank offline. The episode is now a canonical case study in how measuring AI activity as a proxy for AI productivity produces the opposite of the intended result.

Why are reasoning models like GPT o3 and Claude Opus wasteful for simple tasks?

Reasoning models (such as GPT o3, Claude Opus) use 'deliberative logic', they generate internal 'reasoning tokens' as they think through a problem step by step, spending seconds or minutes before producing an answer. This is enormously valuable for genuinely complex, multi-step problems. But these same models are routinely deployed for trivial tasks that require no deliberation, renaming a variable, drafting a one-paragraph email, answering a factual question. Using a reasoning model for '1+1' is, as Firat Elbey (Principal Product Manager) put it, 'rocket fuel in a lawnmower.' The unnecessary reasoning cycles generate latency, drive up infrastructure costs, and consume energy with zero marginal benefit. Analysts estimate unnecessary prompt verbosity costs enterprises tens of billions of dollars annually in excess compute.

What is the solution to tokenmaxxing?

The solution is adaptive resource allocation, matching model capability to task complexity, rather than deploying the most powerful available model for everything. Use lightweight, fast, inexpensive models for routine tasks: boilerplate code, documentation, standard refactoring, simple queries. Reserve reasoning-capable, expensive models for genuinely complex problems that require multi-step deliberation. Implement token consumption governance: set budgets, monitor usage, require direct accountability between AI spend and user-facing feature delivery. Andrew Macdonald (COO, Uber): 'If you can't draw a direct line to how many useful features and functionalities you're delivering to users, that trade becomes harder to justify.' Knowing when to engage deep processing is the new competitive alpha.

Why are AI providers switching from subscription to token-based billing?

The traditional 'seat-based' subscription model, a fixed monthly fee per user, is collapsing under the weight of agentic AI. As AI agents automate tasks that previously required human users, the number of seats needed falls. But the token consumption per remaining user explodes. Providers including Cursor, Vercel (V0), Replit, and Lovable all switched to token-based usage billing in Spring 2025. SAP CEO Christian Klein called it 'foolish' to continue subscription billing when AI automation devalues the per-seat model. Microsoft pushed users back toward GitHub Copilot — industry analysts read this primarily as a cost-control measure. The shift to token billing makes hidden costs visible and passes consumption overruns directly to the enterprise.

What role do Chinese AI models play in the token cost war?

Chinese AI providers, led by DeepSeek, have introduced extreme price competition that is reshaping the global AI economics. DeepSeek charges approximately $3.48 per million output tokens. OpenAI and Anthropic charge $25–$30 per million output tokens for comparable tasks. This roughly 8–9× price gap is forcing US providers to reconsider pricing and forcing enterprise buyers to adopt multi-model strategies: use cheap Chinese models for standard, low-risk tasks; reserve premium US models for business-critical, compliance-sensitive, or security-constrained work. The price arbitrage is real, but so is the strategic risk: Chinese models are subject to different data governance, export control implications, and ideological constraints.

Tokenmaxxing: Why Your Company's AI Bill Is About to Destroy Your Budget

An illustration of a man in a business suit walking along a tightrope over a background of lined ledger paper. A trail of metallic coins with microchips on them floats behind him, leading toward a red circular line.

There is a number that is not making headlines the way it should.

Five hundred million dollars. Spent on AI by a single company. In a single month. Not because they planned to. Because nobody set a limit.

And here is the part that makes it genuinely strange: this happened while the price of AI tokens was falling. While the industry was celebrating cheaper, more accessible AI. While every vendor was telling enterprises that AI was now affordable for everyone.

The cheaper AI got, the more money companies spent. And the word that describes what happened, the cultural dynamic, the broken incentive structure, the systematic destruction of budget without corresponding value, is one you need to understand before it happens in your organisation.

It's called tokenmaxxing. And the definition you've probably heard is wrong.

What Tokenmaxxing Actually Is

Most people who have encountered the term think tokenmaxxing means using AI as effectively as possible, maximising the value you extract from every prompt. That's the positive framing. It's not what the term means in the context where it actually matters.

Tokenmaxxing is the deliberate or inadvertent maximisation of AI token consumption without corresponding value creation.

It is what happens when a company declares that AI usage is a KPI. When a manager announces that teams should be "using AI for everything." When a leaderboard ranks employees by how many tokens they consume per week. When developers start feeding the AI pointless tasks, renaming variables, generating boilerplate that takes five seconds to type, asking questions faster answered by memory, just to hit a metric.

Tokenmaxxing is not a power-user skill. It is a corporate failure mode. And it is costing the industry billions.

The paradox of the Tokenpocalypse: Token prices dropped approximately 90% since 2023. By some metrics, the cost per token fell 280×. And yet total enterprise AI spending exploded by an estimated 320%. This is the Jevons Paradox made operational: when a resource gets cheaper, consumption rises so fast that total cost goes up, not down. The cheaper the tokens, the more tokens get wasted.

The Leaderboard That Broke Everything

Amazon built exactly this system. They called it KiroRank.

KiroRank tracked token consumption by employee and published an internal leaderboard. The target: 80%+ weekly AI usage. The assumption behind it was that more AI use equals more productivity. That if you quantify AI engagement, you incentivise AI adoption, and that accelerates transformation.

What actually happened is a textbook case of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Developers figured out very quickly that the leaderboard rewarded AI interaction, not useful AI interaction. So they fed the AI trivial, pointless requests, thousands of them, purely to inflate their score. No value was created. Enormous resources were consumed. Amazon had to take KiroRank offline.

This is not an isolated story. It is the pattern.

Uber: Burned its entire annual AI token budget in four months. 84% of developers classified as "agentic users," generating costs that dwarfed projections.
Anonymous enterprise client: $500 million in AI costs in a single month, no licence limits had been set for employee usage.
Meta: Employees consumed an estimated 60 trillion Claude tokens in 30 days.
NVIDIA: Compute costs for AI inference now exceed the team's personnel costs.
Disney: Warned its engineers explicitly not to maximise AI prompt usage, calling it "expensive procrastination."

Andrew Macdonald, COO at Uber, put it plainly: "If you can't draw a direct line to how many useful features and functionalities you're delivering to users, that trade becomes harder to justify. That connection isn't there yet."

The Hidden Losses Nobody Put in the Pitch Deck

The AI productivity promise was always about what gets created faster. Nobody put the downstream costs in the slide.

A study of 2,444 companies reveals what actually happens behind every dollar of AI token spend:

Cost Driver	Per $1 of AI Spend	What Causes It
Bug fixing	$0.44	Correcting errors introduced by AI-generated code
Code rewriting	$0.27	Manually rewriting inefficient or unstable AI output
Review delays	$0.11	AI code floods pipelines, slowing review and merge cycles
Total hidden losses	$0.82	For every $1 spent on tokens, $0.82 follows in hidden costs
Durable value created	$0.18	Only 18 cents of every AI dollar produces lasting output

Faros AI found that Code Churn, the ratio of deleted to added code, increased by 861% with AI tool adoption. GitClear found that the revision effort is 2.2× greater than the productivity gain. The real acceptance rate of AI code, once you count revisions in the weeks following initial acceptance, drops to 10–30%. Not the 80–90% figure that shows up in management dashboards.

The AI writes more, faster. Then the engineers rewrite most of it.

Rocket Fuel in a Lawnmower

There are two distinct classes of AI model cost problem. One is tokenmaxxing, using AI for everything, regardless of necessity. The other is model misallocation, using the wrong type of model for the task.

Reasoning models like GPT o3 and Claude Opus are built for deliberative, multi-step logic. They generate internal "reasoning tokens" as they think, spending seconds or minutes working through a problem before producing an answer. For genuinely complex work, this is invaluable. For renaming a variable, it is, as Firat Elbey (Principal Product Manager) put it: "rocket fuel in a lawnmower."

Unnecessary reasoning cycles generate latency, drive up infrastructure costs, and consume energy with zero marginal benefit. Analysts estimate that prompt verbosity alone, using reasoning models where lightweight models would suffice, costs enterprises tens of billions in excess compute annually.

The Jellyfish study makes the non-linearity vivid: a 10× token budget produces only a 2× output improvement. Tokens behave like rocket fuel, to increase velocity modestly, you must increase resource consumption exponentially.

And then there is the Agentic Loop Multiplier. Autonomous AI agents work in loops: Plan → Execute → Reflect → repeat. At every loop iteration, the agent must re-read the entire accumulated context from all previous steps. Token consumption therefore grows exponentially with each cycle, not linearly. Goldman Sachs projects this will produce a 24× increase in global token consumption by 2030, reaching 120 quadrillion tokens per month.

The New Economics — and the Only Way Out

The market is already restructuring around the Tokenpocalypse. Subscription billing is dying.

Cursor, Vercel, Replit, and Lovable all switched to token-based usage billing in Spring 2025, passing consumption overruns directly to enterprises. SAP CEO Christian Klein called it "foolish" to continue seat-based subscriptions when AI automation devalues the per-user model. Microsoft pushed its own employees back toward GitHub Copilot, industry analysts read this as cost control, not capability consolidation.

Meanwhile, the price war is brutal. DeepSeek charges $3.48 per million output tokens. OpenAI and Anthropic charge $25–30 for comparable tasks. That 8–9× gap is forcing a multi-model strategy on every serious enterprise: cheap models for standard work, premium models only where complexity demands it.

The era of unlimited AI consumption is over. The companies that win the next phase will not be those that used AI the most. They will be the ones that built governance around AI use, that required a direct, measurable line between token spend and user-facing value. Tokenmaxxing was the experiment. The bill just arrived. Knowing when to engage deep processing is the new alpha.

After the AI Hype

The tokenmaxxing crisis is the operational version of the AI hype problem. What happens when the bubble meets the balance sheet, and who was right all along.

Ford Fired the Humans, Trusted the Machines

Ford's AI quality push is the manufacturing equivalent of tokenmaxxing. What happens when you deploy AI before you understand its limits.

The Internet Is Drowning in AI Slop

The content equivalent of tokenmaxxing: volume without value, at industrial scale. How AI slop is poisoning the information ecosystem.

The DeepSeek Sputnik Moment

The $6 million model that rewrote AI economics, and whose efficiency story is the direct answer to the tokenmaxxing crisis.

Tokenmaxxing: FAQ

Tokenmaxxing is the deliberate or inadvertent maximisation of AI token consumption without corresponding value creation. It occurs when employees or organisations use AI for every trivial task, renaming variables, writing one-line functions, generating answers to questions faster answered by human memory, simply because the AI is available and cheap per token. When companies mistakenly equate token consumption with productivity, or track AI usage as a KPI, they create incentive structures that reward volume of AI use over quality of output. The result: budgets burned, code rewritten, and no measurable improvement in what actually gets delivered to users.

Tokenpocalypse is the industry term for the current crisis in which companies are burning through AI budgets at catastrophic speed despite, or because of, falling token prices. Token prices dropped approximately 90% since 2023 (by some metrics up to 280×), yet total enterprise AI spending has exploded by an estimated 320%. This follows the Jevons Paradox: when a resource becomes cheaper, consumption increases disproportionately, often to the point where total spend is far higher than before the price drop. Uber burned its entire annual AI budget in four months. An anonymous enterprise client spent $500 million in a single month. Meta employees consumed an estimated 60 trillion Claude tokens in 30 days.

The Jevons Paradox, first described by economist William Stanley Jevons in 1865, states that when the efficiency of using a resource increases (making it cheaper per unit), total consumption of that resource rises rather than falls, because the lower price triggers far greater demand. In AI: tokens cost 90% less per unit than they did in 2023. But because they are so cheap, organisations now use AI for tasks they would never have automated before. The barrier to use vanishes. The individual cost per interaction is trivial. The aggregate cost is catastrophic.

Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure." Amazon implemented an internal leaderboard called KiroRank that tracked and ranked employees by AI token consumption, setting targets of over 80% weekly usage. Developers began feeding AI trivial, pointless requests purely to inflate their score. The leaderboard rewarded volume of AI interaction, not value of output. Amazon was forced to take KiroRank offline. The episode is now a canonical case study in how measuring AI activity as a proxy for AI productivity produces the opposite of the intended result.

A study of 2,444 companies reveals that behind every dollar spent on AI tokens, nearly $0.82 in hidden losses follow. $0.44 goes to fixing bugs introduced by AI-generated code. $0.27 goes to rewriting AI-produced code that was inefficient, unstable, or wrong. $0.11 is lost to review and merge delays as AI-generated code floods engineering pipelines. Only $0.18 of every dollar invested in AI tokens generates durable, lasting value. The nominal productivity gains from AI code generation are more than consumed by the downstream maintenance and correction costs, a phenomenon researchers call "technical debt acceleration."

Code Churn is the ratio of deleted or modified lines of code compared to newly added lines, a measure of how much code needs to be revised after it is written. Research from Faros AI shows that Code Churn has increased by 861% with the adoption of AI coding tools. GitClear found that the revision effort is 2.2 times greater than the productivity gain from AI code generation. While management sees AI acceptance rates of 80–90%, the real acceptance rate, accounting for the code that gets rewritten in the weeks after initial acceptance, is only 10–30%. AI writes more code faster. But much of it needs to be corrected, rewritten, or discarded almost immediately.

The Agentic Loop Multiplier is the exponential cost amplifier that occurs when autonomous AI agents work in iterative loops. Unlike a simple chat interaction, an agentic AI works in cycles: Plan → Execute → Reflect → repeat. At every loop step, the agent must re-read the entire accumulated context from previous steps before deciding what to do next. Token consumption therefore grows exponentially, not linearly, with task complexity. Goldman Sachs projects that the rise of agentic AI will produce a 24× increase in global token consumption by 2030, reaching 120 quadrillion tokens per month.

Reasoning models (GPT o3, Claude Opus) generate internal "reasoning tokens" as they think through problems step by step, spending seconds or minutes before producing an answer. This is enormously valuable for genuinely complex, multi-step problems. But these models are routinely deployed for trivial tasks that require no deliberation. Using a reasoning model for "1+1" is, as Firat Elbey (Principal Product Manager) put it, "rocket fuel in a lawnmower." The unnecessary reasoning cycles generate latency, drive up infrastructure costs, and consume energy with zero marginal benefit. Analysts estimate unnecessary prompt verbosity costs enterprises tens of billions of dollars annually in excess compute.

The solution is adaptive resource allocation, matching model capability to task complexity. Use lightweight, fast, inexpensive models for routine tasks: boilerplate code, documentation, standard refactoring, simple queries. Reserve reasoning-capable, expensive models for genuinely complex problems requiring multi-step deliberation. Implement token consumption governance: set budgets, monitor usage, require direct accountability between AI spend and user-facing feature delivery. Andrew Macdonald (COO, Uber): "If you can't draw a direct line to how many useful features and functionalities you're delivering to users, that trade becomes harder to justify." Knowing when to engage deep processing is the new competitive alpha.

Chinese AI providers, led by DeepSeek, have introduced extreme price competition reshaping global AI economics. DeepSeek charges approximately $3.48 per million output tokens. OpenAI and Anthropic charge $25–$30 per million output tokens for comparable tasks. This roughly 8–9× price gap is forcing enterprise buyers to adopt multi-model strategies: use cheap models for standard, low-risk tasks; reserve premium US models for business-critical, compliance-sensitive, or security-constrained work. The price arbitrage is real, but so is the strategic risk, Chinese models are subject to different data governance, export control implications, and ideological constraints that must be assessed per use case.

Jans Bock-Schroeder

Publisher & Founder of AI Angst

Coming from the world of art, photography, and the luxury market, Jans launched AI Angst in 2025 to explore the cultural, ethical, and psychological impacts of artificial intelligence. His work bridges creative vision with critical technology analysis, offering clarity in an era of rapid technological change.

Sources and Citations

This article is based on the following primary sources, research studies, and industry reports:

Faros AI: "Engineering Efficiency Report: Code Churn and AI Tools" (2024–2025)
Primary source for the 861% Code Churn increase figure, cited alongside Waydev and GitClear research on AI-generated code revision rates.
https://www.faros.ai/
GitClear: "Coding on Copilot" Research Report (2024)
Source for the finding that AI code revision effort is 2.2× greater than the productivity gain, and for real acceptance rate data of 10–30%.
https://www.gitclear.com/
Goldman Sachs: "Generative AI: Too Much Spend, Too Little Benefit?" and AI token consumption projections (2025)
Source for the 24× token consumption increase projection by 2030 and the 120 quadrillion tokens/month agentic AI forecast.
https://www.goldmansachs.com/insights/
Jellyfish: "Engineering Benchmarks: AI Token Budget and Output Scaling" (2025)
Source for the non-linear scaling finding: a 10× token budget produces only a 2× output improvement.
https://jellyfish.co/
Andrew Macdonald (President/COO, Uber): Public statements on AI ROI and token governance (2025)
Source for the direct quote on connecting AI spend to user-facing feature delivery.
https://www.uber.com/newsroom/
AI Angst: "Die Tokenmaxxing-Krise" Research Briefing (2025/2026)
Internal editorial research briefing (German language): "Die Tokenmaxxing-Krise: Kosten, Ineffizienz und der Wandel der KI-Wirtschaft." Source for KiroRank details, Disney warning, Meta/Uber/NVIDIA budget figures, and Jevons Paradox framing.
Internal research briefing: AI Angst editorial archives.

Last verified: July 5, 2026. All external links open in a new tab.

Tokenmaxxing: Why Your Company's AI Bill Is About to Destroy Your Budget

What Tokenmaxxing Actually Is

The Leaderboard That Broke Everything

The Hidden Losses Nobody Put in the Pitch Deck

Rocket Fuel in a Lawnmower

The New Economics — and the Only Way Out

After the AI Hype

Ford Fired the Humans, Trusted the Machines

The Internet Is Drowning in AI Slop

The DeepSeek Sputnik Moment

Tokenmaxxing: FAQ

What is tokenmaxxing?

What is the Tokenpocalypse?

What is the Jevons Paradox and how does it apply to AI tokens?

What is Goodhart's Law and how did it destroy Amazon's KiroRank?

What are the hidden costs behind every dollar of AI token spend?

What is Code Churn and how has AI caused it to explode?

What is the Agentic Loop Multiplier?

Why are reasoning models wasteful for simple tasks?

What is the solution to tokenmaxxing?

What role do Chinese AI models play in the token cost war?

Jans Bock-Schroeder

Sources and Citations