A ground-up introduction: why next-token prediction is not enough, what a world model actually is, and why learning to predict the future of an environment may be the next step toward grounded, agentic intelligence.
TL;DR. A language model learns the statistics of text. A world model learns the dynamics of an environment — given where you are and what you do, what happens next. This post starts from the familiar next-token objective, shows precisely where it stops being enough for agents that must act, and builds up the idea of a world model from first principles. It is Part 0 of a step-by-step series that will go from these basics down into the contemporary research literature.
I’m writing this series the way I wish someone had written it for me: starting at the very beginning and going deep, one step at a time. Each part is a standalone Markdown post — I’ll fold in formal definitions, intuition, figures, and, where relevant, my own proposals and notes. Everything is properly referenced, with full citations written plainly in the References at the end of each page, so that if any of this helps your own paper you can cite the primary sources directly (and this post too — see How to cite this post).
If you prefer to watch an excellent framing of the same shift — from large language models toward joint-embedding world models — the talk “From LLM to JEPA” is a great companion to this post1:
A modern large language model (LLM) is, at heart, an autoregressive next-token predictor2,3. Given a sequence of tokens \(x_{1}, \dots, x_{t-1}\), it models the probability of the next one,
\[p_\theta(x_t \mid x_{1:t-1}),\]and the probability of an entire sequence factorizes by the chain rule:
\[p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{1:t-1}).\]Training minimizes the negative log-likelihood (equivalently, cross-entropy) of the next token,
\[\mathcal{L}_{\text{LM}}(\theta) = - \, \mathbb{E}_{x \sim \mathcal{D}} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).\]This single objective, scaled up, produces astonishing competence. But notice what is being modeled: the distribution of human-generated text. An LLM is a model of what a person is likely to write next — not a model of what the world will do next. Those coincide only to the extent that text faithfully describes reality, which is often loosely, sometimes not at all.
For an agent that must perceive, decide, and act, three gaps open up:
This is the ceiling. To cross it we need a model whose variables are states and actions, not just words.
Informally, a world model is an internal, predictive model of how an environment evolves — a learned simulator you can query: “if the state is this and I do that, what happens?” The idea is old: self-supervised recurrent predictors of the environment4 and model-based planning in reinforcement learning5,6 are its direct ancestors.
Formally, most agentic settings are partially observed: the agent never sees the true state \(s_t\), only observations \(o_t\) (pixels, sensors). A world model introduces a compact latent state \(z_t\) and learns three pieces:
\[\underbrace{z_t \sim q_\phi(z_t \mid z_{t-1}, a_{t-1}, o_t)}_{\text{(1) encoder / inference}}, \qquad \underbrace{z_{t} \sim p_\theta(z_t \mid z_{t-1}, a_{t-1})}_{\text{(2) latent transition (the "dynamics")}},\] \[\underbrace{\hat{o}_t \sim p_\theta(o_t \mid z_t), \quad \hat{r}_t \sim p_\theta(r_t \mid z_t)}_{\text{(3) decoders: reconstruct observation \& reward}}.\]The contrast with an LLM is now sharp and worth stating side by side:
| Language model | World model | |
|---|---|---|
| Predicts | next token \(x_t\) | next state/observation \(z_{t+1}, o_{t+1}\) |
| Conditioned on | past tokens \(x_{<t}\) | past state and action \(z_t, a_t\) |
| Models the statistics of | human text | environment dynamics |
| “Reasoning” = | more token generation | rolling out futures in latent space |
| Native to | dialogue, code, retrieval | perception, planning, control |
The key new ingredient is the action \(a_t\): a world model is conditional on what the agent does. That is exactly the variable a language model lacks, and exactly the variable an agent needs.
The cleanest starting blueprint is Ha & Schmidhuber’s V–M–C decomposition7:
That last point is the magic: a trained dynamics model lets you generate experience in imagination, decoupling learning from expensive real-world interaction.
1. Sample efficiency — learning in a dream. If \(p_\theta(z_{t+1}\mid z_t,a_t)\) is accurate, the controller can be optimized on imagined rollouts instead of real ones. This is exactly the Dreamer line of work, which learns behaviors “by latent imagination” and scales across hundreds of tasks from a single configuration9,10,11. Concretely, the controller maximizes imagined return
\[\max_{\pi}\; \mathbb{E}_{p_\theta,\,\pi}\!\left[\sum_{\tau=t}^{t+H} \gamma^{\,\tau-t}\, \hat{r}_\tau \right],\]with the whole rollout synthesized by the model over a horizon \(H\).
2. Planning & reasoning that is grounded. With a simulator inside its head, an agent can search over action sequences, compare outcomes, and choose — model-predictive control, rather than reflex.
3. Generalization & grounding. Forcing a network to predict consequences pressures it to discover the causal structure of its environment (objects, physics, agency) — representations that transfer.
4. A candidate path to autonomous machine intelligence. This is the thesis behind LeCun’s proposal for a modular, predictive, world-model-centric architecture as a route past the limits of pure text prediction12.
There is a subtle but crucial design choice hiding in decoder (3) above. Do we predict the raw future observation (every pixel), or only a representation of it?
so the model is rewarded for capturing predictable structure and free to discard unpredictable noise12,13. This is the “From LLM to JEPA” shift in one equation: stop predicting every token/pixel, start predicting the abstract state that actually matters for acting.
A non-exhaustive map of where the field is, to orient the rest of the series:
The contemporary literature here is vast and moving fast; throughout the series I’ll treat recent papers as the best raw material for generating new research ideas. For a continuously-updated, annotated guide, see the companion World Models — Reading Map and the structured Deep Dives.
This was Part 0 — the why and the vocabulary. Upcoming parts (added one by one) will go deep, with derivations and code, into: latent-variable models and the VAE/ELBO; recurrent vs. Transformer dynamics; the full Dreamer objective; JEPA and energy-based learning; generative interactive worlds (Genie / Sora); evaluation; and open problems and proposals. Each will live in its own post under the World Models section of the blog.
Comments are open at the bottom of every post — feedback, corrections, and pointers to papers I should cover are very welcome.
If this series is useful for your work, please cite the primary sources listed plainly in the References below. To reference this post itself:
@misc{haque2026worldmodels0,
author = {Md Rezwanul Haque},
title = {World Models --- Part 0: From Language Models to World Models},
year = {2026},
howpublished = {\url{https://rezwan.xyz/blog/2026/world-models-introduction/}},
note = {Blog post, CPAMI Lab, University of Waterloo}
}