One of my most frustrating things regarding the potential of an AI bubble was some very smart and intelligent researcher being incredibly bullish on AI on Twitter because if you extrapolate graphs measuring AI's ability to complete long-duration tasks (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...) or other benchmarks then by 2026 or 2027 then you've basically invented AGI.
I'm going to take his statements at face value and assume that he really does have faith in his own predictions and isn't trying to fleece us.
My gripe with this statement is that this prediction is based on proxies for capability that aren't particularly reliable. To elaborate, the latest frontier models score something like 65% on SWE-bench, but I don't think they're as capable as a human that also scored 65%. That isn't to say that they're incapable, but just that they aren't as capable as an equivalent human. I think there's a very real chance that a model absolutely crushes the SWE-bench benchmark but still isn't quite ready to function as an independent software engineering agent.
So a lot of this bullishness basically hinges on the idea that if you extrapolate some line on a graph into the future, then by next year or the year after all white-collar work can be automated. Terrifying as that is, this all hinges on the idea that these graphs, these benchmarks, are good proxies.
There's a huge disconnect between what the benchmarks are showing and what the day-to-day experience of those of us using LLMs are experiencing. According to SWE-bench, I should be able to outsource a lot of tasks to LLMs by now. But practically speaking, I can't get them to reliably do even the most basic of tasks. Benchmaxxing is a real phenomenon. Internal private assessments are the most accurate source of information that we have, and those seem to be quite mixed for the most recent models.
How ironic that these LLM's appear to be overfitting to the benchmark scores. Presumably these researchers deal with overfitting every day, but can't recognize it right in front of them
I'm sure they all know it's happening. But the incentives are all misaligned. They get promotions and raises for pushing the frontier which means showing SOTA performance on benchmarks.
>> by next year or the year after all white-collar work can be automated
Work generates work. If you remove the need for 50% of the work then a significant amount of the remaining work never needs to be done. It just doesn't appear.
The software that is used by people in their jobs will no longer be needed if those people aren't hired to do their jobs. There goes Slack, Teams, GitHub, Zoom, Powerpoint, Excel, whatever... And if the software isn't needed then it doesn't need to be written, by either a person or an AI. So any need for AI Coders shrinks considerably.
I'm going to take his statements at face value and assume that he really does have faith in his own predictions and isn't trying to fleece us.
My gripe with this statement is that this prediction is based on proxies for capability that aren't particularly reliable. To elaborate, the latest frontier models score something like 65% on SWE-bench, but I don't think they're as capable as a human that also scored 65%. That isn't to say that they're incapable, but just that they aren't as capable as an equivalent human. I think there's a very real chance that a model absolutely crushes the SWE-bench benchmark but still isn't quite ready to function as an independent software engineering agent.
So a lot of this bullishness basically hinges on the idea that if you extrapolate some line on a graph into the future, then by next year or the year after all white-collar work can be automated. Terrifying as that is, this all hinges on the idea that these graphs, these benchmarks, are good proxies.
And if they aren't, oh wow.