World Models — Part 0: From Language Models to World Models

A ground-up introduction: why next-token prediction is not enough, what a world model actually is, and why learning to predict the future of an environment may be the next step toward grounded, agentic intelligence.

TL;DR. A language model learns the statistics of text. A world model learns the dynamics of an environment — given where you are and what you do, what happens next. This post starts from the familiar next-token objective, shows precisely where it stops being enough for agents that must act, and builds up the idea of a world model from first principles. It is Part 0 of a step-by-step series that will go from these basics down into the contemporary research literature.


Why this series

I’m writing this series the way I wish someone had written it for me: starting at the very beginning and going deep, one step at a time. Each part is a standalone Markdown post — I’ll fold in formal definitions, intuition, figures, and, where relevant, my own proposals and notes. Everything is properly referenced, with full citations written plainly in the References at the end of each page, so that if any of this helps your own paper you can cite the primary sources directly (and this post too — see How to cite this post).

If you prefer to watch an excellent framing of the same shift — from large language models toward joint-embedding world models — the talk “From LLM to JEPA” is a great companion to this post1:


The triumph (and the ceiling) of language models

A modern large language model (LLM) is, at heart, an autoregressive next-token predictor2,3. Given a sequence of tokens \(x_{1}, \dots, x_{t-1}\), it models the probability of the next one,

\[p_\theta(x_t \mid x_{1:t-1}),\]

and the probability of an entire sequence factorizes by the chain rule:

\[p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{1:t-1}).\]

Training minimizes the negative log-likelihood (equivalently, cross-entropy) of the next token,

\[\mathcal{L}_{\text{LM}}(\theta) = - \, \mathbb{E}_{x \sim \mathcal{D}} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).\]

This single objective, scaled up, produces astonishing competence. But notice what is being modeled: the distribution of human-generated text. An LLM is a model of what a person is likely to write next — not a model of what the world will do next. Those coincide only to the extent that text faithfully describes reality, which is often loosely, sometimes not at all.

For an agent that must perceive, decide, and act, three gaps open up:

  1. No grounded dynamics. Text rarely encodes the precise physical consequences of actions. “I pushed the glass” does not contain the trajectory, the friction, or the shatter.
  2. No native notion of action. The LM conditions on past tokens, not on an agent’s actions \(a_t\) and their effect on a state.
  3. Planning is implicit and ungrounded. “Reasoning” emerges as more token generation, with no internal simulator to roll out and compare candidate futures.

This is the ceiling. To cross it we need a model whose variables are states and actions, not just words.


So what is a world model?

Informally, a world model is an internal, predictive model of how an environment evolves — a learned simulator you can query: “if the state is this and I do that, what happens?” The idea is old: self-supervised recurrent predictors of the environment4 and model-based planning in reinforcement learning5,6 are its direct ancestors.

Formally, most agentic settings are partially observed: the agent never sees the true state \(s_t\), only observations \(o_t\) (pixels, sensors). A world model introduces a compact latent state \(z_t\) and learns three pieces:

\[\underbrace{z_t \sim q_\phi(z_t \mid z_{t-1}, a_{t-1}, o_t)}_{\text{(1) encoder / inference}}, \qquad \underbrace{z_{t} \sim p_\theta(z_t \mid z_{t-1}, a_{t-1})}_{\text{(2) latent transition (the "dynamics")}},\] \[\underbrace{\hat{o}_t \sim p_\theta(o_t \mid z_t), \quad \hat{r}_t \sim p_\theta(r_t \mid z_t)}_{\text{(3) decoders: reconstruct observation \& reward}}.\]

The contrast with an LLM is now sharp and worth stating side by side:

  Language model World model
Predicts next token \(x_t\) next state/observation \(z_{t+1}, o_{t+1}\)
Conditioned on past tokens \(x_{<t}\) past state and action \(z_t, a_t\)
Models the statistics of human text environment dynamics
“Reasoning” = more token generation rolling out futures in latent space
Native to dialogue, code, retrieval perception, planning, control

The key new ingredient is the action \(a_t\): a world model is conditional on what the agent does. That is exactly the variable a language model lacks, and exactly the variable an agent needs.


The anatomy of a world model

The cleanest starting blueprint is Ha & Schmidhuber’s V–M–C decomposition7:

Environment obs oₜ Encoder (V) oₜ → zₜ Dynamics (M) zₜ,aₜ → zₜ₊₁ Controller (C) π(aₜ|zₜ) action aₜ closes the loop — and can also be rolled out purely in imagination (zₜ → zₜ₊₁ → ⋯)
The perceive → encode → predict → act loop. Once Dynamics (M) is trained, the agent can roll the loop forward without touching the real environment — "dreaming" trajectories to plan or to train the controller.

That last point is the magic: a trained dynamics model lets you generate experience in imagination, decoupling learning from expensive real-world interaction.


Why world models matter

1. Sample efficiency — learning in a dream. If \(p_\theta(z_{t+1}\mid z_t,a_t)\) is accurate, the controller can be optimized on imagined rollouts instead of real ones. This is exactly the Dreamer line of work, which learns behaviors “by latent imagination” and scales across hundreds of tasks from a single configuration9,10,11. Concretely, the controller maximizes imagined return

\[\max_{\pi}\; \mathbb{E}_{p_\theta,\,\pi}\!\left[\sum_{\tau=t}^{t+H} \gamma^{\,\tau-t}\, \hat{r}_\tau \right],\]

with the whole rollout synthesized by the model over a horizon \(H\).

2. Planning & reasoning that is grounded. With a simulator inside its head, an agent can search over action sequences, compare outcomes, and choose — model-predictive control, rather than reflex.

3. Generalization & grounding. Forcing a network to predict consequences pressures it to discover the causal structure of its environment (objects, physics, agency) — representations that transfer.

4. A candidate path to autonomous machine intelligence. This is the thesis behind LeCun’s proposal for a modular, predictive, world-model-centric architecture as a route past the limits of pure text prediction12.


Predicting pixels vs. predicting representations

There is a subtle but crucial design choice hiding in decoder (3) above. Do we predict the raw future observation (every pixel), or only a representation of it?

\[E(x, y) = \big\| \, s_y - P(s_x, v) \, \big\|^2 ,\]

so the model is rewarded for capturing predictable structure and free to discard unpredictable noise12,13. This is the “From LLM to JEPA” shift in one equation: stop predicting every token/pixel, start predicting the abstract state that actually matters for acting.


A quick tour of the landscape

A non-exhaustive map of where the field is, to orient the rest of the series:

The contemporary literature here is vast and moving fast; throughout the series I’ll treat recent papers as the best raw material for generating new research ideas. For a continuously-updated, annotated guide, see the companion World Models — Reading Map and the structured Deep Dives.


Where this series is going

This was Part 0 — the why and the vocabulary. Upcoming parts (added one by one) will go deep, with derivations and code, into: latent-variable models and the VAE/ELBO; recurrent vs. Transformer dynamics; the full Dreamer objective; JEPA and energy-based learning; generative interactive worlds (Genie / Sora); evaluation; and open problems and proposals. Each will live in its own post under the World Models section of the blog.

Comments are open at the bottom of every post — feedback, corrections, and pointers to papers I should cover are very welcome.


How to cite this post

If this series is useful for your work, please cite the primary sources listed plainly in the References below. To reference this post itself:

@misc{haque2026worldmodels0,
  author = {Md Rezwanul Haque},
  title  = {World Models --- Part 0: From Language Models to World Models},
  year   = {2026},
  howpublished = {\url{https://rezwan.xyz/blog/2026/world-models-introduction/}},
  note   = {Blog post, CPAMI Lab, University of Waterloo}
}

References

  1. AMI Labs. [JEPA, EBM, World Models] AMI Labs and the Architecture of Actionable World Models: From LLM to JEPA. YouTube, 2026. youtube.com/watch?v=UaHwJeCMzso
  2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention Is All You Need. NeurIPS, 2017.
  3. T. B. Brown et al. Language Models are Few-Shot Learners. NeurIPS, 2020.
  4. J. Schmidhuber. Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning. Tech. Report FKI-126-90, TU Munich, 1990.
  5. R. S. Sutton. Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
  6. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction (2nd ed.). MIT Press, 2018.
  7. D. Ha and J. Schmidhuber. World Models / Recurrent World Models Facilitate Policy Evolution. NeurIPS, 2018. arXiv:1803.10122
  8. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. ICLR, 2014. arXiv:1312.6114
  9. D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. ICLR, 2020. arXiv:1912.01603
  10. D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with Discrete World Models (DreamerV2). ICLR, 2021. arXiv:2010.02193
  11. D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Domains through World Models (DreamerV3). 2023. arXiv:2301.04104
  12. Y. LeCun. A Path Towards Autonomous Machine Intelligence. OpenReview (v0.9.2), 2022. openreview.net
  13. M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA). CVPR, 2023. arXiv:2301.08243
  14. J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, et al. Genie: Generative Interactive Environments. ICML, 2024. arXiv:2402.15391
  15. OpenAI. Video Generation Models as World Simulators. 2024. openai.com
  16. P. Fung et al. Embodied AI Agents: Modeling the World. Meta, 2025. arXiv:2506.22355
  17. G. Savva, O. Michel, …, S. Xie. Solaris: Building a Multiplayer Video World Model in Minecraft. NYU, 2026. arXiv:2602.22208
  18. J. Zhang et al. World-in-World: World Models in a Closed-Loop World. ICLR 2026 (Oral). arXiv:2510.18135
  19. D. Chen, W. Chung, Y. Bang, Z. Ji, and P. Fung. WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning. Meta, 2025. arXiv:2506.04363