By Our Measure
As AI systems improve, we keep coming back to the same question: can they do the things humans do? Can they reason, write, code, make decisions well enough to fit into the institutions and workflows we already have? Most of the conversation assumes progress will show up as better performance on tasks we already know how to value.
Human intelligence is the only intelligence we know from the inside, and the economy is built around human work. So those become the reference points. But that doesn’t make them neutral. They reflect how we already think, what our institutions reward, and what we count as useful, not intelligence in some general sense.
The AlphaGo match against Lee Sedol is still the clearest example. 1 DeepMind, AlphaGo research overview. In 2016, AlphaGo played a move that professional commentators initially called a mistake. It didn’t look like something a strong human would do. The move turned out to be excellent, and the misreading said more about human expectations than about the move. Most unusual machine output really is bad, and treating it that way is usually right. But sometimes a system is doing something effective in a form we don’t know how to read, and the first instinct is to call it broken.
AlphaFold makes the same point in biology. In its Nature paper on CASP14, 2 Jumper et al., “Highly accurate protein structure prediction with AlphaFold”, Nature (2021). AlphaFold reached a median backbone accuracy of 0.96 angstroms compared with 2.8 for the next-best method. Its database later grew to more than 200 million predicted structures. Whether AlphaFold “understands biology” the way a biologist does is the wrong question. What matters is that it produced scientific output at a scale no single researcher could match. Whole classes of structural questions that used to require heroic individual effort became cheap and routine to ask.
In both cases, the system was effective in a way human observers weren’t prepared to recognize, and the mismatch made the results hard to read. The default response is a familiar one: what would a strong human do here?
Business language does the same thing. Once companies are buying it, the vocabulary turns practical: copilot, assistant, agent, support tool, productivity layer. The questions are about labor: how much time it saves, how much work it offloads. That makes AI easiest to value when it looks like a worker, when it writes the email, summarizes the meeting, handles the ticket. Those map cleanly onto jobs, budgets, and metrics.
The framing isn’t wrong. In McKinsey’s 2025 global survey, 3 McKinsey, The State of AI: How organizations are rewiring to capture value (2025). 78 percent of respondents said their organizations were using AI in at least one business function, and 71 percent were regularly using generative AI. Only 21 percent had redesigned any workflows around it. More than 80 percent said generative AI still had no tangible impact on enterprise-level EBIT. Companies have adopted the tools faster than they’ve rebuilt around them.
Beyond normal organizational inertia, companies are simply better equipped to measure AI’s effect on existing tasks than to notice when it calls for different ones.
When AI looks like a worker, it’s easy to evaluate. In a field study of more than 5,000 customer-support agents at a Fortune 500 software company, 4 Brynjolfsson, Li, and Raymond, “Generative AI at Work”, NBER Working Paper 31161 (2023). access to a generative AI assistant raised productivity by about 15 percent on average, with the largest gains going to newer and less experienced workers. AI looked less like a replacement for labor and more like a way to spread the tacit knowledge of top performers. The result matters, but it also points evaluators at a narrow question: does AI improve current tasks, inside current roles, by current standards? That misses the cases where the job itself is the wrong frame, not whether AI is good at it.
In a pre-registered study of 758 BCG consultants, 5 Dell’Acqua et al., “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality”, Harvard Business School Working Paper 24-013 (2023). participants using GPT-4 completed more tasks, finished faster, and produced higher-quality work, but only on tasks inside the model’s capability frontier. Outside it, the same participants did worse than those without AI. In related work, GPT-4 improved some kinds of creative product-innovation performance while degrading performance on certain business problem-solving tasks. What looks like aggregate competence is actually a jagged frontier: the same system can help on one task and hurt on a neighboring one, and users often can’t tell which side they’re on. Most evaluation frameworks smooth that unevenness out.
The 2025 METR study of experienced open-source developers 6 METR, “Early 2025 AI tools slow down experienced open source developers” (2025). makes a related point. Going in, both the developers and outside experts expected the AI tools to speed the work up substantially. Instead, early-2025 frontier tools made experienced developers working on their own mature repositories about 19 percent slower.
The 19 percent matters, but what the result implies about measurement matters more. Even something as simple-sounding as “time to complete the task” can hide a lot. Those developers were moving through years of accumulated context, conventions, tradeoffs, and local architecture. The work only looks like a clean task if you abstract away most of what makes it real. This is the general problem with AI evaluation: the framework simplifies the work first, then judges the system inside the simpler version.
Electrification shows the same problem at industrial scale. Paul David’s work on it 7 Paul A. David, “The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox”, American Economic Review (1990). argues that the biggest effects of electricity didn’t come from swapping out steam power for electric and leaving the rest in place. Diffusion came first; the productivity surge came much later, after factories had reorganized production around the new technology. More recent studies of U.S. manufacturing find that electricity changed capital intensity, workforce composition, and the complexity of production as much as it changed output.
AI may have the same kind of lag. For now, it’s easiest to buy, sell, and manage as labor-like software: something that writes, analyzes, codes, or handles support. AI becoming a better version of any of those is real and ongoing. But the larger effects, if they come, will probably be organizational: changes to how work flows, how decisions get made, how knowledge is structured, and whether certain roles and handoffs are still worth keeping.
That’s a different claim from “AI will automate jobs,” even if the two are related. Substitution is the part we already have language for; reorganization is harder to talk about, and we’re behind on it.
The same bias works at a more basic level. We trust intelligence more when it arrives in a human register, when it narrates its steps, explains itself in familiar language, produces output that looks like thought as we already know it. A more capable nonhuman intelligence may not feel more satisfying as it gets more capable, and in some domains it may feel less so.
That’s the pattern across the examples. With AlphaGo, AlphaFold, the BCG experiments, and the METR result, the people doing the evaluating were using standards built to compare humans to other humans, and those standards shaped what they could see. The standards came out of human biology, institutions, labor markets, and social life. They work in most contexts, but they carry assumptions about what intelligence looks like that aren’t guaranteed to hold when the thing being evaluated stops looking much like us.
We have productivity metrics, time-on-task metrics, quality scores, benchmark suites, and head-to-head comparisons with human performance. We have much less for noticing when AI has changed the work itself: which tasks matter, which roles still hold together, which processes are worth preserving, which questions can now be asked at all.