By Our Measure

One of the things I keep coming back to with AI is how quickly we fall back on human comparisons.

As these systems improve, we naturally ask versions of the same question over and over: can it do the kinds of things people do? Can it reason well, write well, code well, make decisions well? Can it fit into the institutions and workflows we already have? A lot of the conversation, even now, still assumes that progress in AI will mostly show up as better performance on things we already recognize and already know how to value.

That makes sense up to a point. Human intelligence is the only intelligence we actually know from the inside. Human work is what our economy is organized around. Human institutions are the structures any new technology has to enter if it’s going to matter in practice. So of course those become the reference points.

Still, I think there’s a way this can quietly distort how we see what’s in front of us. What feels natural to us can start to feel objective. We forget that our benchmarks are coming from somewhere — from our own habits of mind, our own institutions, our own sense of what useful intelligence is supposed to look like.

The AlphaGo match against Lee Sedol, 1 DeepMind, AlphaGo research overview. is still maybe the cleanest example of this. In 2016, AlphaGo played a move that professional commentators initially read as a mistake. It just didn’t look like the sort of move a great human player would make. Later it became clear that this reaction had a lot more to do with the limits of human expectation than with the quality of the move itself.

I don’t think the lesson there is that every weird machine output is secretly brilliant. Most weird machine output is just bad. The more interesting lesson is narrower than that. Sometimes a system is doing something genuinely effective, but because it doesn’t arrive in a form we’re used to recognizing, our first instinct is to misread it.

Something similar happened with AlphaFold, though in a very different domain. In the Nature paper on its CASP14 performance, 2 Jumper et al., “Highly accurate protein structure prediction with AlphaFold”, Nature (2021). AlphaFold reached a median backbone accuracy of 0.96 angstroms, compared with 2.8 for the next-best method. Its database later expanded to more than 200 million predicted protein structures. What seems important to me here is not whether AlphaFold “understands biology” in a way that resembles a strong human biologist. It’s that it generated scientific usefulness at a scale no human researcher or research team could have matched directly. Biology, for lack of a better way to put it, was responsive to what it produced.

That’s part of why these examples feel connected to me. In both cases, the system wasn’t just doing a familiar task a bit better. It was producing a kind of competence that didn’t fit especially neatly into human categories for judging competence in the first place. And when that happens, we tend to reach for the same fallback questions: what would a strong human do here? what would good reasoning look like to us? what shape should real understanding take?

I think some version of that is still all over how we talk about AI now, especially in business settings.

Once AI becomes something companies are actually buying and deploying, the language around it gets very practical very quickly. It becomes a copilot, an assistant, an agent, a support tool, a coding tool, a productivity layer. The framing is usually pretty straightforward: how much human work does this help with? Does it save time? Does it reduce labor? Does it improve output on tasks we already care about?

Again, there’s nothing crazy about looking at it this way. It’s probably the default way any organization would look at it. But it also nudges attention in a certain direction. It makes AI easiest to value when it looks like labor — when it writes the email, summarizes the meeting, handles the support ticket, drafts the memo, suggests the code. Those things map pretty cleanly onto jobs, budgets, teams, and metrics. They fit the managerial imagination.

And to be fair, that framing does capture something real. In McKinsey’s 2025 global survey, 3 McKinsey, The State of AI: How organizations are rewiring to capture value (2025). 78 percent of respondents said their organizations were using AI in at least one business function, and 71 percent said they were regularly using generative AI. At the same time, only 21 percent said their organizations had redesigned even some workflows around it, and more than 80 percent said generative AI still had no tangible impact on enterprise-level EBIT. So adoption is moving quickly, but deeper organizational change seems to be moving much more slowly.

I don’t think that’s just a lag. I think it also says something about what companies know how to see.

When AI looks like a worker, or at least like software standing in for a worker, it’s relatively easy to evaluate. You can ask whether it saved time, reduced costs, improved output, or helped less experienced people perform more like the best people on the team. In a field study of more than 5,000 customer-support agents at a Fortune 500 software company, 4 Brynjolfsson, Li, and Raymond, “Generative AI at Work”, NBER Working Paper 31161 (2023). access to a generative AI assistant raised productivity by about 15 percent on average, with the biggest gains going to newer and lower-skilled workers. That’s a meaningful result. In settings like that, AI looks less like a wholesale replacement for labor and more like a way of spreading the tacit knowledge of top performers across a broader workforce.

But this way of looking at things also trains attention pretty narrowly. It encourages us to ask whether AI improves existing tasks, inside existing roles, using existing standards of performance. That’s a perfectly reasonable question. It’s just not the only one, and maybe not the most important one.

Some of the more revealing cases are the ones that don’t fit that frame very well.

In a pre-registered study of 758 BCG consultants, 5 Dell’Acqua et al., “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality”, Harvard Business School Working Paper 24-013 (2023). participants using GPT-4 completed more tasks, finished them faster, and produced higher-quality work — but only on tasks that fell within the model’s capability frontier. On tasks outside that frontier, the same participants actually did worse than those without AI access. In related work, GPT-4 improved certain kinds of creative product-innovation performance while degrading performance on some business problem-solving tasks.

People sometimes describe this as a jagged technological frontier, which seems right to me. But a lot of the ways we evaluate these systems flatten the picture. They can make us miss both how capable the systems are in some domains and how sharply they fall off in others. A system can seem broadly competent in aggregate while still being unexpectedly strong in some places and surprisingly fragile in others, and those boundaries may not be especially visible to the people using it.

The 2025 METR study of experienced open-source developers, 6 METR, “Early 2025 AI tools slow down experienced open source developers” (2025). makes a similar point from a different angle. Before the experiment, both the developers and outside experts expected AI tools to speed the work up substantially. Instead, early-2025 frontier tools made experienced developers working on their own mature repositories about 19 percent slower.

That result is interesting, but not simply because it’s negative. What it suggests is that even a benchmark as straightforward-sounding as “time to complete the task” can hide a lot. Those developers were not just typing code into a box. They were moving through years of accumulated context, conventions, tradeoffs, local architecture, and background knowledge. The work only looks like a clean task if you abstract away most of what makes it real.

That broader issue comes up a lot with AI. Our evaluative frameworks often make the work look simpler than it is before the system has even been judged. They flatten complex activity into something legible, then measure success inside that simplified frame.

History gives a larger version of the same pattern. Electrification transformed industry, but the biggest effects didn’t come from swapping out steam power for electric power while leaving everything else basically intact. The larger gains came later, once factories reorganized around the new technology. Paul David’s well-known work on electrification, 7 Paul A. David, “The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox”, American Economic Review (1990). makes exactly this point: widespread diffusion came first, and the real productivity surge followed much later, once production itself had been restructured. More recent studies of U.S. manufacturing find that electricity changed not only productivity, but capital intensity, workforce composition, and the complexity of production itself.

That feels relevant here. New technologies often arrive by fitting themselves into the old order. Their deeper effects only become visible once the order starts rearranging around them.

AI may turn out to follow a similar path.

Right now, the easiest way to understand it is as labor-like software. That’s how markets buy it, how managers deploy it, and how workers feel its pressure. But the most important effects may not come from AI becoming a somewhat better writer, analyst, coder, or support agent. They may come from changing what counts as a workflow in the first place, how decisions get made, how knowledge is structured, what kinds of coordination are still necessary, and which parts of an organization continue to deserve the shape they currently have.

That’s a different kind of claim from “AI will automate jobs,” even though the two are obviously related. One is about substitution. The other is about reorganization. We have much better language for the first than for the second.

That’s part of why AlphaFold still feels like the right sort of example to me. The interesting measure of its impact is not whether its biological reasoning resembles that of a brilliant PhD student. It’s whether entire classes of scientific questions became easier to ask, easier to test, and easier to integrate into ordinary research practice. How much of the protein-structure landscape became newly accessible? What moved from heroic to routine? What became queryable that had previously been out of reach?

Those are structural questions. They aren’t really about whether the machine performed a familiar task well. They’re about whether the possibility space itself changed.

And in most domains, we still don’t have especially good language for that. We have productivity metrics, time-on-task metrics, quality scores, benchmark suites, and head-to-head comparisons with human performance. What we have much less of is a durable way to notice when the structure of work has shifted beneath us — when AI hasn’t simply done the old task faster, but has altered which tasks matter, which roles cohere, which processes are worth preserving, or which questions can be asked in the first place.

Part of understanding AI clearly will require building better language for that sort of change.

The problem here isn’t only economic. It’s cognitive too. We tend to trust intelligence when it arrives in a human register. We like systems that can narrate their steps, explain themselves in familiar terms, and produce outputs that feel like competent thought as we already know it. But a more powerful nonhuman intelligence may not become more intuitively satisfying as it becomes more capable. In some domains, it may become less so. It may look stranger, not more reassuring.

That, to me, is the recurring pattern underneath all of this. With AlphaGo, with AlphaFold, with the BCG experiments, and even with the METR result, the problem wasn’t simply that people were wrong. It was that they were using standards that had mostly been built for comparing humans to other humans. Those standards came out of our biology, our institutions, our labor markets, and our social life. They aren’t useless. But they aren’t neutral either, and they aren’t guaranteed to hold up when what we’re evaluating stops looking very much like us.

The challenge ahead is not just that AI may become more capable than we are in some domains. It’s that we may keep misunderstanding where the important changes are because we keep looking for competence in the forms we already know how to reward. We may overvalue the systems that imitate human work most cleanly and miss the ones that are changing the structure of the work itself.