This piece originally appeared at Hyperdimensional.
We have a pivotal year ahead of us on many different fronts. To that end, I’d like to tell you the developments I am expecting in both AI and AI policy, as well as some of the things I will be monitoring, even if I do not expect them to come to fruition in 2025. Let’s get to it.
Frontier AI in 2025
For a long time, it was fair to dispute whether large language models “really” reasoned. Are these statistical artifacts, these compressions of the internet, “thinking”? Or are they just predicting the next word? Do they “understand” anything deep about the world, or do they simply reflect statistical patterns in data?
It was always, in my view, a muddled debate. A raw, pretrained language model (with no reinforcement learning-based posttraining), trained to predict the next token of the entire internet, has to answer a simple, but profound question: why is it that all the words ever written were written the way they were? “What does it mean to predict the next token well enough?” asked then–OpenAI chief scientist Ilya Sutskever in a 2023 interview. “It’s actually a much deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token…. In order to understand those statistics … you need to understand: What is it about the world that creates this set of statistics?” Of course this process produced models that could reason sometimes.
But this understanding, and any associated reasoning, was profoundly uneven. The pretrained language model is, as many have observed, a kind of homunculus, picking up on untold numbers of human concepts and the relationships between them—and likely, fractal patterns in human communication that not even we pick up on. In retrospect, it is probably more appropriate to see the models we have been using for the last two years as a kind of feedstock of genuine intelligence rather than the thing itself. For the last year or so, my question has been: can we cultivate that feedstock—can we, somehow, turn it into something more useful? And to what extent?
Enter OpenAI o1 (and now, o3). These models, trained with reinforcement learning, blow away other models on most math and coding benchmarks, reaching levels of performance that surprised even many deep learning bulls. In my subjective experience, they are also among the very best at answering questions unrelated to math and coding, including pure humanities questions. Perhaps most intriguingly to me, o1-preview shows superior knowledge calibration to other models (knowing what it does know, and knowing what it does not know) on OpenAI’s recent SimpleQA benchmark, which is designed to test models for their propensity to hallucinate.
The o-series models, then, are among the first to demonstrate that cultivation is a convincing way. They will not be the last. Already, the jump from o1 to o3 (previewed, but not released by OpenAI in late December) shows the significant performance gains to be had by scaling their reinforcement-learning based approach.