Creativity, World Models and the Old Brain

Since the first bizarre convolutional neural network images began pumping out of Google DeepDream a decade ago, I’ve been interested in whether we’re on a path to AI being truly creative. With LLMs it felt like we were getting close. The jump from early GPT models to ChatGPT was astounding, and it felt like if we kept accelerating then truly creative AI seemed inevitable.
2025 has seen some impressive advances. It feels appropriate that the year is bookended by DeepSeek-R1 at the opening and Opus 4.5 at the close. But I’m increasingly convinced that the dream of AI-driven creativity is going to remain a dream no matter how well we build LLMs.
However, I also think there’s a path to something far more interesting, potentially through world models. As a bit of a final wrap of the year, I thought it might be valuable to pull together various threads around AI research, AI reality, human creativity, and where the opportunity lies for people working in the creative world.
The map and the territory
An LLM predicts tokens. It learns statistical patterns in language, in text about the world. LLMs are remarkably good at this, good enough that the outputs can feel like understanding. A world model is a different approach to building AI. It’s an internal representation that simulates and predicts the world itself. Not descriptions of things, but the things themselves.
Jeff Hawkins’ recent book A Thousand Brains makes this distinction vivid and understandable. His theory is that the neocortex is essentially a collection of world models – thousands of them, each learning the structure of objects and concepts through spatial reference frames. When you understand a coffee cup, you don’t just have a description. You have a model of how it feels from different angles, how it behaves when tilted, where the handle is relative to the rim. You can rotate it in your mind.
And these models aren’t limited to physical objects. Your neocortex has a model of a chair, but it also has a model of democracy, of a specific song and a good joke. Concepts get the same treatment as coffee cups,structured representations that you can manipulate, test, and reconfigure.
LLMs learn the map. World models learn the territory.
Tokens (the basis of today’s LLMs) are symbols. Arbitrary signifiers that point at meanings but don’t contain meaning. When you read “the ball rolled under the table,” you understand it because you have a grounded model of balls, rolling, tables, and spatial relations, built from years of experience.
An LLM processes that sentence without any of that grounding. It’s learning patterns in how people talk about balls rolling under tables. The map, never the territory.
For a chatbot just knowing the map is fine most of the time. Take a bunch of input tokens, calculate the next one. For creativity though, you need to understand the territory.
There’s a counter-argument to this “word calculator” argument worth taking seriously: Language isn’t just arbitrary symbols. It’s a lossy compression of human experience. Billions of people describing physical interactions, spatial relationships, causal chains. The structure of reality leaves fingerprints in how we talk about it. Maybe if you train on enough text, you learn the latent structure that generated it?
The honest answer is that we don’t fully know. But I’d argue that outputs looking similar doesn’t mean the process is similar. And this isn’t a new debate, it’s been kicking around AI research for decades.
Enter video models
In 2025, Sora and its successors have reignited this debate. OpenAI even called Sora a “world simulator.” To generate coherent video, these models must track object permanence, physics, spatial consistency, causality. Things don’t just disappear when occluded. Gravity works. Camera movements reveal coherent geometry. The justification for calling this a world model is that you can’t generate plausible fifteen-second videos without learning something about how the world works.
The case against these models having a true concept of our world is pretty strong though. even in the most advanced video models you’ll see objects morphing or duplicating, gravity working inconsistently, and impossible spatial transformations. These failures reveal that Sora and friends have learned what videos look like, but not what happens in the world. It’s pattern-matching on visual trajectories, not reasoning about physics.
When you catch a ball, you’re not predicting pixels. You’re tracking an object with position and velocity through space, with implicit models of gravity and your own motor capabilities. The representation is abstract. Object, trajectory, force. Not low-level sensory reconstruction.
This is what’s missing from video models. There’s no object representation. No explicit sense that “there is a ball, it has velocity, gravity acts on it.” Just patterns of pixels that tend to flow in certain ways.
The creativity question
LLMs can produce novel combinations. Token sequences that have never appeared together before. This happens constantly. Statistically, almost everything an LLM generates is technically novel. But that’s not what we mean by creativity.
Real creative insight requires something else entirely. Creativity is about recognising that something doesn’t fit existing models, constructing a new conceptual frame that resolves it, and then knowing you’ve done something important.
The first step – noticing anomalies – LLMs might approximate through pattern irregularity. The second step is where world models matter. You need structured representations you can reconfigure, not just token sequences you can shuffle. An LLM is randomly assembling tiles. A world model could, in principle, redesign the mosaic.
But even world models might not get you to that third step of recognition. Recognising that you’ve constructed something genuinely new – that this idea matters and that one doesn’t – requires something beyond modelling. It requires a sense of significance. A felt sense that the anomaly has been resolved in a way that matters.
This might be where the architecture of intelligence runs into a harder problem. Where even world models don’t have the drive to create, because ultimately and despite all the post-training in the world, AI models lack true motivation. They have no drive.
The old brain
Back to the Thousand Brain theory. Hawkins claims that the neocortex doesn’t decide to model the world. The limbic system, brainstem, hypothalamus, evolutionarily older structures, generate the drives. Hunger, fear, curiosity, social bonding. The neocortex is in service of those drives from the old brain.
AI has no equivalent to the old brain. The “goal” is next-token prediction because that’s the training objective, not because the system wants anything. There’s no felt dissatisfaction when it’s wrong. No curiosity pulling it toward understanding.
There is an argument that we can engineer motivation through reward functions. But these functions creates compliance, not drive. A directed system pursues insights when told to, stops when told to, and has no preference either way. It wouldn’t return to a problem unbidden. There’s no 3am nagging sense that something doesn’t fit, that a solution could be more elegant.
For creative work, this matters. Genuine scientific and artistic breakthroughs often come from caring about an anomaly, being bothered by something that doesn’t fit, and returning to it repeatedly. You can direct an AI to explore a domain. But you can’t make it care.
The diversity problem
Even with world models and engineered motivation, there’s another gap: judgment. Good judgment on open-ended problems requires diverse world models. Different experiences, cultures, values, histories. When we debate democracy or talk about a song, we’re not converging on truth. We’re negotiating between genuinely different perspectives grounded in genuinely different lives.
A population of AIs trained on similar data, by similar teams, with similar architectures, have correlated blind spots. Agreement that looks like consensus but is actually homogeneity.
There’s been plenty of debate this year about whether AI can have taste. And considered through this lens it’s a pointless question. A token-based AI can’t even represent the concept of taste. It has no model of aesthetic value, no accumulated experience of what works and what doesn’t, no felt sense of quality, and no true drive to create. In the very human mental model of modern art as “my five year old could do that”, an LLM isn’t even at the cognitive level of your five year old.
We could, in theory, build diverse AI through varied training. In practice it’s expensive, and not aligned to the goals of Silicon Valley, middle east sovereign wealth and the AI hype machine. The people building these systems are culturally homogeneous, with incentives toward uniformity, not diversity. Even then, what experiences would you give an AI to develop diverse judgment? It doesn’t have a body, a childhood, a death to anticipate. The experiences that shape human diversity might be constitutively unavailable to AI.
The limits of building
From working with and building dozens of AI experiments over the past few years, there are two classes of AI problems that I’ve encountered.
The first is closed-loop and verifiable challenges, where reality gives immediate feedback that allows us to evaluate AI performance. Protein folding, robotics, chess. You can check whether outputs are right.
Text conversation is also in this category. When I use an LLM to draft an email or summarise text, I can verify whether it works. There’s a feedback loop. That’s why LLMs are useful for (but not replacing) a lot of knowledge work – not because they understand, but because the human in the loop can verify (at least when the human in the loop verifies).
The second class is open-loop and contested challenges. Human judgment is required. Democracy design, ethics, policy, aesthetics. “Better” is debated, not measured. There’s no external referee. And even where there is a feedback loop, it’s happening over timeframes that are measured in hours or years, not the microseconds that reinforcement learning demands.
World models can generate and predict within closed-loop domains. Novel insights, even, if the feedback exists. But for open-loop problems, where “better” is contested rather than measurable, human arbitration remains essential. And that might be a stable division of labour, not a temporary one.
There’s a deeper question underneath all of this. We don’t have good theories of how creativity (or consciousness, felt experience, and motivation) arise in brains, let alone how to engineer it. Despite the hype, we’re not at “we know how but it’s hard.” We’re at “we don’t know if this is the kind of thing that can be built.”
Into 2026
The last few years have seen an unbelievable amount of energy and money spent chasing ghosts in the machine. Alignment researchers worrying about systems that don’t exist. Believers waiting for magic that isn’t coming. Meanwhile, the technology that actually exists, and which is insanely impressive, gets lost in the noise.
As 2025 closes out, I’ve lost confidence that LLMs can be creative in any true sense. The architecture doesn’t support it. Tokens aren’t concepts. Pattern matching isn’t insight. This isn’t a dismissal of the technology, but a clarification of what it actually is. The value is in learning to use LLMs as tools around creativity. Scaffolding, reflection, high-bandwidth sparring partners for your own thinking.
Going into 2026, I think world models may be where the interesting work will emerge. There may be a path where true novel ideas can emerge from machines built with world models. It’s a long way between building a mental model of a chair and building one of a good joke. But the path may exist, and following it will be far more compelling than the magical thinking that’s flooded the discourse this year.
- December 2025