By Our Measure

As AI systems improve, we keep returning to the same basic question: can they do the kinds of things people do? Can they reason, write, code, make decisions well enough to fit into the institutions and workflows we already have? A lot of the conversation still assumes that progress in AI will show up mainly as better performance on tasks we already recognize and know how to value.

That makes sense. Human intelligence is the only intelligence we know from the inside. Human work is what our economy is organized around. So of course those become the reference points. But this can quietly distort how we see what’s in front of us. What feels natural starts to feel objective. We forget that our benchmarks come from somewhere: from our own habits of mind, our own institutions, our own sense of what useful intelligence looks like.

The AlphaGo match against Lee Sedol is still the clearest example., 1 DeepMind, AlphaGo research overview. In 2016, AlphaGo played a move that professional commentators initially read as a mistake. It didn’t look like the sort of move a great human player would make. Later it became clear that this reaction had more to do with the limits of human expectation than with the quality of the move itself. Most unusual machine output is just bad, and there’s no reason to assume otherwise as a default. But sometimes a system is doing something genuinely effective, and because it doesn’t arrive in a form we’re used to recognizing, the first instinct is to misread it.

AlphaFold makes a related point in a different domain. In the Nature paper on its CASP14 performance,, 2 Jumper et al., “Highly accurate protein structure prediction with AlphaFold”, Nature (2021). AlphaFold reached a median backbone accuracy of 0.96 angstroms, compared with 2.8 for the next-best method. Its database later expanded to more than 200 million predicted protein structures. The question of whether AlphaFold “understands biology” the way a strong human biologist does seems like the wrong one to focus on. What matters more is that it generated scientific usefulness at a scale no human researcher could have matched directly. Entire classes of structural questions became easier to ask, easier to test, and easier to integrate into ordinary research practice. What had previously required heroic individual effort became routine and queryable.

In both cases, the system produced competence that didn’t fit neatly into human categories for judging competence, and that mismatch made the results hard to read in the moment. When that happens, we tend to reach for the same fallback: what would a strong human do here?

That framing shows up throughout how AI gets talked about in business settings. Once companies are actually buying it, the language becomes practical fast: copilot, assistant, agent, support tool, productivity layer. How much human work does this help with? Does it save time? Does it reduce labor? This makes AI easiest to value when it resembles labor — when it writes the email, summarizes the meeting, handles the support ticket. Those things map cleanly onto jobs, budgets, and metrics.

The framing captures something real. In McKinsey’s 2025 global survey,, 3 McKinsey, The State of AI: How organizations are rewiring to capture value (2025). 78 percent of respondents said their organizations were using AI in at least one business function, and 71 percent said they were regularly using generative AI. But only 21 percent said their organizations had redesigned even some workflows around it, and more than 80 percent said generative AI still had no tangible impact on enterprise-level EBIT. Adoption is moving quickly; deeper organizational change is not.

Some of that gap reflects normal organizational inertia. But it also reflects the fact that most companies are better equipped to measure AI’s effect on existing tasks than to notice when it calls for different tasks altogether.

When AI looks like a worker, it’s relatively easy to evaluate. In a field study of more than 5,000 customer-support agents at a Fortune 500 software company,, 4 Brynjolfsson, Li, and Raymond, “Generative AI at Work”, NBER Working Paper 31161 (2023). access to a generative AI assistant raised productivity by about 15 percent on average, with the biggest gains going to newer and lower-skilled workers. AI looked less like a wholesale replacement for labor and more like a way of spreading the tacit knowledge of top performers more broadly. That’s a meaningful result. But it also trains attention narrowly: it encourages asking whether AI improves existing tasks, inside existing roles, using existing standards of performance. That leaves out cases where the interesting thing about AI is precisely that it doesn’t behave like a better version of an existing worker.

In a pre-registered study of 758 BCG consultants,, 5 Dell’Acqua et al., “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality”, Harvard Business School Working Paper 24-013 (2023). participants using GPT-4 completed more tasks, finished faster, and produced higher-quality work, but only on tasks within the model’s capability frontier. On tasks outside that frontier, the same participants did worse than those without AI access. In related work, GPT-4 improved certain kinds of creative product-innovation performance while degrading performance on some business problem-solving tasks. A system can seem broadly competent in aggregate while being unexpectedly strong in some places and surprisingly fragile in others, and those boundaries may not be visible to the people using it. A lot of evaluation frameworks flatten exactly this unevenness out.

The 2025 METR study of experienced open-source developers, 6 METR, “Early 2025 AI tools slow down experienced open source developers” (2025). makes a similar point from a different angle. Before the experiment, both the developers and outside experts expected AI tools to speed the work up substantially. Instead, early-2025 frontier tools made experienced developers working on their own mature repositories about 19 percent slower.

The negative result is worth paying attention to, but what it suggests about measurement may matter more than the number itself. Even something as straightforward-sounding as “time to complete the task” can hide a lot. Those developers were moving through years of accumulated context, conventions, tradeoffs, and local architecture. The work only looks like a clean task if you abstract away most of what makes it real. This is a general problem with evaluating AI: our frameworks often make the work look simpler than it is before the system has even been judged. They flatten complex activity into something legible, then measure success inside that simplified frame.

History suggests a larger version of the same pattern. Paul David’s work on electrification, 7 Paul A. David, “The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox”, American Economic Review (1990). makes exactly this point: the biggest effects of electricity on industry didn’t come from swapping out steam power for electric power while leaving everything else intact. Widespread diffusion came first, and the real productivity surge followed much later, once factories had reorganized production itself around the new technology. More recent studies of U.S. manufacturing find that electricity changed not only productivity but capital intensity, workforce composition, and the complexity of production.

AI may follow a similar path. Right now, the easiest way to understand it is as labor-like software. That’s how markets buy it, how managers deploy it, and how workers feel its pressure. And AI becoming a somewhat better writer, analyst, coder, or support agent is a real and ongoing development. But the larger effects probably look more like changes in what counts as a workflow in the first place, how decisions get made, how knowledge is structured, what kinds of coordination remain necessary, and which parts of an organization deserve the shape they currently have.

That’s a different kind of claim from “AI will automate jobs,” even though the two are related. Substitution is the part we already have language for. Reorganization is harder to talk about, and we’re further behind on it.

There’s a cognitive dimension to this beyond the economic one. We tend to trust intelligence when it arrives in a human register: when it narrates its steps, explains itself in familiar terms, produces outputs that feel like competent thought as we already know it. But a more capable nonhuman intelligence may not become more intuitively satisfying as it becomes more capable. In some domains, it may become less so.

That’s the recurring pattern beneath all of this. With AlphaGo, AlphaFold, the BCG experiments, and the METR result, the people involved were using standards built mostly for comparing humans to other humans, and those standards shaped what they were able to see. Those standards came out of our biology, our institutions, our labor markets, and our social life. They’re useful in many contexts, but they carry assumptions about what intelligence looks like that aren’t guaranteed to hold when what we’re evaluating stops looking much like us.

We have productivity metrics, time-on-task metrics, quality scores, benchmark suites, and head-to-head comparisons with human performance. What we have much less of is a way to notice when the structure of work has shifted beneath us: when AI has altered which tasks matter, which roles cohere, which processes are worth preserving, or which questions can be asked in the first place.