The Metric that eats itself

There is a slide that has been appearing in decks lately. It has a bar chart. The X-axis is team names or business units. The Y-axis is tokens consumed. There is a goal line. Some bars are red. The slide is usually called something like “AI Adoption Dashboard” or “Generative AI Utilization” or my personal favourite, “AI Maturity Index”

Tokenmaxxing

This is tokenmaxxing: the practice of consuming AI tokens not to produce useful output, but to satisfy a metric that somebody decided was a proxy for AI adoption. It’s been reported at Meta, where internal leaderboards track token usage and engineers have quietly learned that burning tokens on low-value tasks keeps their position looking healthy. It’s been reported at Amazon, where an internal tool called MeshClaw has been deployed under explicit management pressure, and where employees told the Financial Times, with a candor that must have been cathartic, that “some people are just using it to maximise token usage.”

Charlie Munger said it more concisely than I can:

show me the incentive and I’ll show you the outcome.

The specific cruelty of this particular law

Goodhart’s Law is patient. (I said this in the benchmark article. I’ll probably still be saying it in 2032.) When a measure becomes a target, it ceases to be a good measure. But there’s a specific cruelty to tokenmaxxing that goes beyond the standard Goodhart problem.

When you set a revenue target and salespeople game it by pulling deals forward, you at least end up with revenue. When you set a token target and engineers game it by padding prompts, making the model re-explain things it already explained, generating outputs that get immediately discarded, you end up with API bills. That is the complete list of outputs.

This is measuring steering wheel turns to assess whether your drivers are getting to their destinations. Some of them are. Some of them have learned to be very efficient at turning the wheel.

The $200 billion question

Amazon is spending $200 billion on AI infrastructure this year. Some portion of the utilization that justifies that number will be people asking a model to summarize a document it already summarized, because the quarter ends on Friday.

I want to be fair to the managers who commissioned the bar chart. They have a real problem: they need to demonstrate to their leadership that the AI investment is being used. “Are we getting value from this?” is a reasonable question. Tokens are a visible number. Visibility is appealing when the alternative is admitting that value is difficult to measure and you haven’t yet figured out how.

The mistake is assuming that tokens consumed answer the question. Tokens tell you whether the API endpoint was called. They tell you nothing about whether the work changed. Whether the decision was better. Whether the customer interaction went differently. You might as well measure keystrokes to evaluate developer productivity. (Some organizations have tried this. It did not end well for anyone, least of all the developers.)

What fills a vacuum

I’ve noticed that the organizations most enthusiastic about token metrics tend to be the ones that haven’t yet decided what they want AI to do. The metric is filling a vacuum. It looks like progress because it goes up. It measures nothing that matters because nothing that matters has been defined yet.

The organizations that will eventually fix this are the ones that insist on measuring downstream. Not tokens consumed, but decisions influenced. Not API calls made, but problems the model actually touched end to end. That is harder to put in a bar chart. It requires understanding your own workflows well enough to know where the model adds value, which requires doing the unglamorous work of actually mapping those workflows rather than deploying a tool and pointing at the billing dashboard.

The benchmark article was about scores replacing evaluation. This is the same problem from the other direction: measuring adoption instead of utility. Both arrive at the same place. Nobody has to understand what the thing is actually doing. The number goes up and the slide looks fine and the quarterly review moves on.

Goodhart’s Law collects either way.

Listening to while writing: Unwound, Leaves Turn Inside You. Which is also about something expanding until it’s unrecognizable.

Tokenmaxxing

Charlie Munger said it more concisely than I can:

The specific cruelty of this particular law

The $200 billion question

What fills a vacuum

Dit delen:

Comments

Leave a Reply Cancel reply