The Benchmark was never about you

Someone showed me a model comparison last month. Clean slide. Six models, seven benchmark categories, colour coded scores. They had those consultant balls. They’d done the work. They had a winner.

I asked what the model did when it hit our legacy integration layer. You know? The one that sits on top of the mainframe. The one with the weird non-standard date formatting and the minute response timeout that nobody had documented since ever. Long pause:

We’d test that in the pilot.

This is the token leaderboard problem in a nutshell. they answer a question nobody was actually asking.

The leaderboard was built for someone else

Token benchmarks (MMLU, HumanEval, reasoning suites, the Chatbot Arena Elo rankings) exist to compare models against each other under controlled conditions. They are useful for the people who build models. They tell researchers something meaningful about capability gaps between architectures.

They tell you almost nothing about whether a model is right for your context.

Your context has non-normalised data of questionable quality, undocumented edge cases, latency constraints, integration patterns that predate the model’s training cutoff, and users who will find creative ways to break things the benchmark never imagined. A model that scores 91 on reasoning benchmarks is not inherently more useful than one that scores 84 when the gap between them disappears the moment either one touches your CRM’s export format.

The problem is not that leaderboards exist. The problem is that they’ve become procurement criteria. CTOs are selecting foundation models the way people used to select enterprise software. By checking which box has the highest number in the column that sounds important. It is almost as if the benchmark has replaced the evaluation and the score has replaced the question.

Goodhart’s Law is patient. It always collects.

Tools nobody understands

Vibe coding, for the unfamiliar, is the practice of generating functional software through AI prompting without deeply understanding what you’ve built. It works. The demo is often impressive. Except for that small fact that the architecture underneath is a liability that won’t announce itself until something breaks in production at a time that is extremely inconvenient for everyone.

Now companies are having leaderboards on who uses the most tokens. Incentivising what? A Meta employee independently created a leaderboard that tracked how many tokens, those are the basic units of data or words that AI models process, the company’s more than 85,000 employees used, The Information reported on Monday. Called “Claudeonomics,” after Anthropic’s AI model, the leaderboard showed the top 250 token users and awarded employees with titles, such as “Token Legend” and “Cache Wizard”.

The leaderboard encouraged tokenmaxxing a growing phenomenon in Silicon Valley which emphasizes token usage as a measure of productivity.

Thus, the things being built with and on top of these models are increasing and accumulateing at a speed that has quietly outpaced comprehension.

I wrote about the Developer who knows Why things work earlier this year. The vibe coding problem is that same issue scaled from individual contributors to entire product teams. It is no longer one developer who can’t explain their pull request. It is organisations shipping AI-powered features where nobody in the room can give a coherent account of the system’s behaviour under load, at edge cases, or what happens when the underlying model gets updated. Never mind being deprecated. They know what it does in the demo. They don’t know what it does in the wild.

It almost looks like that the understanding is becoming optional. And maybe even that the optional part has been quietly dropped.

The AI that watches the AI

It get’s even wilder. When these tools behave unexpectedly, which they do, the market has developed an answer: AI observability platforms. AI governance layers. AI auditing tools. Dashboards that aggregate model outputs, flag anomalies, and generate reports on system behaviour.

Yes. I hope that is as funny to you as it is for me. But then again I have a dark sense of humor. In essence, it’s more vibecoded AI, watching and reporting on the vibecoded AI. Producing summaries that another AI will eventually be asked to interpret.

I want to be precise about what this is and what it isn’t. Monitoring tools are not useless. Knowing that your model’s output distribution has shifted, or that latency has spiked, or that a particular class of inputs is generating refusals at an unexpected rate, is operationally useful information. None of that is the problem.

The problem is when monitoring becomes a substitute for understanding rather than a supplement to it. When the answer to “do we understand what our full stack is doing?” is “we have a dashboard” When the governance review consists of running a report and noting that no red flags appeared. We are reading outputs we couldn’t interrogate if we tried, produced by tools we couldn’t explain, built on models selected by benchmarks that were never designed for their use case.

At that point you do not have AI governance. You have the idea of AI governance. Yes, that is a sneer. It looks identical until something goes wrong, at which point it looks nothing like governance at all.

What understanding actually requires

There is no monitoring tool that replaces a human being who actually knows how this specific model with these tuned parameters behaves in your specific context. Who has tested the edge cases your users will inevitably find. Who can read an anomaly report and tell you whether it matters. Who can answer the question: “Is this system doing what we think it’s doing, and how would we know if it stopped?

That person is not a product. They are an engineer, or an architect, or a CTO who has done the unglamorous work of actually learning the system rather than delegating the learning to the system itself. Most organisations don’t have this person. This has just made the gap harder to see.

The benchmark told you the model was capable. The vibe coding shipped it fast. The monitoring tool told you it was healthy. At no point did any of this require anyone to understand what was actually happening.

That is not a technology problem. It is a decision about what counts as good enough.

Listening to while writing: Shellac, At Action Park. For the same reason.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.