World Models — Part 0: From Language Models to World Models

A ground-up introduction: why next-token prediction is not enough, what a world model actually is, and why learning to predict the future of an environment may be the next step toward grounded, agentic intelligence.

← Blogs · World Models series

TL;DR. A language model learns the statistics of text. A world model learns the dynamics of an environment — given where you are and what you do, what happens next. This post starts from the familiar next-token objective, shows precisely where it stops being enough for agents that must act, and builds up the idea of a world model from first principles. It is Part 0 of a step-by-step series that will go from these basics down into the contemporary research literature.

Why this series

I’m writing this series the way I wish someone had written it for me: starting at the very beginning and going deep, one step at a time. Each part is a standalone Markdown post — I’ll fold in formal definitions, intuition, figures, and, where relevant, my own proposals and notes. Everything is properly referenced, with full citations written plainly in the References at the end of each page, so that if any of this helps your own paper you can cite the primary sources directly (and this post too — see How to cite this post).

If you prefer to watch an excellent framing of the same shift — from large language models toward joint-embedding world models — the talk “From LLM to JEPA” is a great companion to this post¹:

The triumph (and the ceiling) of language models

A modern large language model (LLM) is, at heart, an autoregressive next-token predictor^2,3. Given a sequence of tokens \(x_{1}, \dots, x_{t-1}\), it models the probability of the next one,

\[p_\theta(x_t \mid x_{1:t-1}),\]

and the probability of an entire sequence factorizes by the chain rule:

\[p_\theta(x_{1:T}) = \prod_{t=1}^{T} p_\theta(x_t \mid x_{1:t-1}).\]

Training minimizes the negative log-likelihood (equivalently, cross-entropy) of the next token,

\[\mathcal{L}_{\text{LM}}(\theta) = - \, \mathbb{E}_{x \sim \mathcal{D}} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{1:t-1}).\]

This single objective, scaled up, produces astonishing competence. But notice what is being modeled: the distribution of human-generated text. An LLM is a model of what a person is likely to write next — not a model of what the world will do next. Those coincide only to the extent that text faithfully describes reality, which is often loosely, sometimes not at all.

For an agent that must perceive, decide, and act, three gaps open up:

No grounded dynamics. Text rarely encodes the precise physical consequences of actions. “I pushed the glass” does not contain the trajectory, the friction, or the shatter.
No native notion of action. The LM conditions on past tokens, not on an agent’s actions \(a_t\) and their effect on a state.
Planning is implicit and ungrounded. “Reasoning” emerges as more token generation, with no internal simulator to roll out and compare candidate futures.

This is the ceiling. To cross it we need a model whose variables are states and actions, not just words.

So what is a world model?

Informally, a world model is an internal, predictive model of how an environment evolves — a learned simulator you can query: “if the state is this and I do that, what happens?” The idea is old: self-supervised recurrent predictors of the environment⁴ and model-based planning in reinforcement learning^5,6 are its direct ancestors.

Formally, most agentic settings are partially observed: the agent never sees the true state \(s_t\), only observations \(o_t\) (pixels, sensors). A world model introduces a compact latent state \(z_t\) and learns three pieces:

\[\underbrace{z_t \sim q_\phi(z_t \mid z_{t-1}, a_{t-1}, o_t)}_{\text{(1) encoder / inference}}, \qquad \underbrace{z_{t} \sim p_\theta(z_t \mid z_{t-1}, a_{t-1})}_{\text{(2) latent transition (the "dynamics")}},\] \[\underbrace{\hat{o}_t \sim p_\theta(o_t \mid z_t), \quad \hat{r}_t \sim p_\theta(r_t \mid z_t)}_{\text{(3) decoders: reconstruct observation \& reward}}.\]

The contrast with an LLM is now sharp and worth stating side by side:

	Language model	World model
Predicts	next token \(x_t\)	next state/observation \(z_{t+1}, o_{t+1}\)
Conditioned on	past tokens \(x_{<t}\)	past state and action \(z_t, a_t\)
Models the statistics of	human text	environment dynamics
“Reasoning” =	more token generation	rolling out futures in latent space
Native to	dialogue, code, retrieval	perception, planning, control

The key new ingredient is the action \(a_t\): a world model is conditional on what the agent does. That is exactly the variable a language model lacks, and exactly the variable an agent needs.

The anatomy of a world model

The cleanest starting blueprint is Ha & Schmidhuber’s V–M–C decomposition⁷:

V — Vision. An encoder (classically a variational autoencoder⁸) compresses each high-dimensional observation \(o_t\) into a small latent code \(z_t\).
M — Memory. A recurrent (or, today, Transformer) dynamics model predicts the next latent given the current latent and action, \(p_\theta(z_{t+1}\mid z_t, a_t, h_t)\), carrying history in its hidden state \(h_t\).
C — Controller. A small policy \(\pi(a_t \mid z_t, h_t)\) chooses actions from the compressed state. Because V and M did the heavy lifting, C can be tiny.

The perceive → encode → predict → act loop. Once Dynamics (M) is trained, the agent can roll the loop forward without touching the real environment — "dreaming" trajectories to plan or to train the controller.

That last point is the magic: a trained dynamics model lets you generate experience in imagination, decoupling learning from expensive real-world interaction.

Why world models matter

1. Sample efficiency — learning in a dream. If \(p_\theta(z_{t+1}\mid z_t,a_t)\) is accurate, the controller can be optimized on imagined rollouts instead of real ones. This is exactly the Dreamer line of work, which learns behaviors “by latent imagination” and scales across hundreds of tasks from a single configuration^9,10,11. Concretely, the controller maximizes imagined return

\[\max_{\pi}\; \mathbb{E}_{p_\theta,\,\pi}\!\left[\sum_{\tau=t}^{t+H} \gamma^{\,\tau-t}\, \hat{r}_\tau \right],\]

with the whole rollout synthesized by the model over a horizon \(H\).

2. Planning & reasoning that is grounded. With a simulator inside its head, an agent can search over action sequences, compare outcomes, and choose — model-predictive control, rather than reflex.

3. Generalization & grounding. Forcing a network to predict consequences pressures it to discover the causal structure of its environment (objects, physics, agency) — representations that transfer.

4. A candidate path to autonomous machine intelligence. This is the thesis behind LeCun’s proposal for a modular, predictive, world-model-centric architecture as a route past the limits of pure text prediction¹².

Predicting pixels vs. predicting representations

There is a subtle but crucial design choice hiding in decoder (3) above. Do we predict the raw future observation (every pixel), or only a representation of it?

Generative world models reconstruct observations: great for visualization and interpretability, but they spend capacity modeling unpredictable, irrelevant detail (every leaf, every texture).
Joint-Embedding Predictive Architectures (JEPA) instead predict in a learned latent space: given a context embedding \(s_x\) and a target embedding \(s_y\), a predictor \(P\) with latent variable \(v\) minimizes an energy

\[E(x, y) = \big\| \, s_y - P(s_x, v) \, \big\|^2 ,\]

so the model is rewarded for capturing predictable structure and free to discard unpredictable noise^12,13. This is the “From LLM to JEPA” shift in one equation: stop predicting every token/pixel, start predicting the abstract state that actually matters for acting.

A quick tour of the landscape

A non-exhaustive map of where the field is, to orient the rest of the series:

Foundational latent world models for control — Ha & Schmidhuber⁷; the Dreamer family^9,10,11.
Joint-embedding / non-generative prediction — the JEPA program^12,13.
Generative interactive environments — Genie learns action-controllable worlds from unlabeled video¹⁴.
Video generators as world simulators — large video models exhibiting emergent simulation of physics and persistence¹⁵.
Embodied, agent-centric world models — recent position work argues the world model is the missing core of embodied agents¹⁶, and multiplayer/multi-agent world models must keep shared views consistent across agents¹⁷.
Evaluation is being rethought — closed-loop benchmarks show that visual realism does not imply task success¹⁸, and high-level procedural planning remains largely unsolved for today’s models¹⁹.

The contemporary literature here is vast and moving fast; throughout the series I’ll treat recent papers as the best raw material for generating new research ideas. For a continuously-updated, annotated guide, see the companion World Models — Reading Map and the structured Deep Dives.

Where this series is going

This was Part 0 — the why and the vocabulary. Upcoming parts (added one by one) will go deep, with derivations and code, into: latent-variable models and the VAE/ELBO; recurrent vs. Transformer dynamics; the full Dreamer objective; JEPA and energy-based learning; generative interactive worlds (Genie / Sora); evaluation; and open problems and proposals. Each will live in its own post under the World Models section of the blog.

Comments are open at the bottom of every post — feedback, corrections, and pointers to papers I should cover are very welcome.

How to cite this post

If this series is useful for your work, please cite the primary sources listed plainly in the References below. To reference this post itself:

@misc{haque2026worldmodels0,
  author = {Md Rezwanul Haque},
  title  = {World Models --- Part 0: From Language Models to World Models},
  year   = {2026},
  howpublished = {\url{https://rezwan.xyz/blog/2026/world-models-introduction/}},
  note   = {Blog post, CPAMI Lab, University of Waterloo}
}

References

AMI Labs. [JEPA, EBM, World Models] AMI Labs and the Architecture of Actionable World Models: From LLM to JEPA. YouTube, 2026. youtube.com/watch?v=UaHwJeCMzso
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention Is All You Need. NeurIPS, 2017.
T. B. Brown et al. Language Models are Few-Shot Learners. NeurIPS, 2020.
J. Schmidhuber. Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning. Tech. Report FKI-126-90, TU Munich, 1990.
R. S. Sutton. Dyna, an Integrated Architecture for Learning, Planning, and Reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction (2nd ed.). MIT Press, 2018.
D. Ha and J. Schmidhuber. World Models / Recurrent World Models Facilitate Policy Evolution. NeurIPS, 2018. arXiv:1803.10122
D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. ICLR, 2014. arXiv:1312.6114
D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. ICLR, 2020. arXiv:1912.01603
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with Discrete World Models (DreamerV2). ICLR, 2021. arXiv:2010.02193
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering Diverse Domains through World Models (DreamerV3). 2023. arXiv:2301.04104
Y. LeCun. A Path Towards Autonomous Machine Intelligence. OpenReview (v0.9.2), 2022. openreview.net
M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA). CVPR, 2023. arXiv:2301.08243
J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, et al. Genie: Generative Interactive Environments. ICML, 2024. arXiv:2402.15391
OpenAI. Video Generation Models as World Simulators. 2024. openai.com
P. Fung et al. Embodied AI Agents: Modeling the World. Meta, 2025. arXiv:2506.22355
G. Savva, O. Michel, …, S. Xie. Solaris: Building a Multiplayer Video World Model in Minecraft. NYU, 2026. arXiv:2602.22208
J. Zhang et al. World-in-World: World Models in a Closed-Loop World. ICLR 2026 (Oral). arXiv:2510.18135
D. Chen, W. Chung, Y. Bang, Z. Ji, and P. Fung. WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning. Meta, 2025. arXiv:2506.04363

← Back to the World Models series