World Models — Reading Map: Recent Literature (2024–2026)
This is a living reading map for the World Models series — a curated, thematically organized guide to the recent literature on world models, active perception, and agentic embodied AI. I treat these papers as the best raw material for idea generation: read across a cluster, find the gap, build there.
It pairs with the conceptual primer in Part 0: From Language Models to World Models, and for ten of these papers there is a structured breakdown — gist, pipeline, benchmark, dataset, results, and improvement/idea angles — in the Deep Dives companion. Annotations are paraphrased in my own words; many entries are fast-moving preprints, so open each link and confirm title/authors/venue before citing in a formal document.
- Embodied AI Agents: Modeling the World — the position paper that makes the world model the core of embodied agents.
- World-in-World — closed-loop evidence that visual realism ≠ task success.
- WorldPrediction — a benchmark showing high-level procedural planning is largely unsolved.
- V* — "where to look" as a learnable skill inside a multimodal model.
1 · Foundations & framing
The classics that define the vocabulary, plus the position papers that set today’s agenda.
2 · World models for embodied agents
Surveys that organize the field, and the two method families that dominate: video-generation vs. latent-prediction world models.
3 · Active perception — “moving to see better”
Agents that choose where to look or move in order to perceive better — increasingly trained with RL rather than hand-designed next-best-view rules.
4 · Agentic & multi-agent embodied AI
Agents that reason, plan, and coordinate — often LLM/VLM-driven, increasingly as multi-agent systems.
5 · Advanced MI — foundation models, spatial cognition & VLAs
The generalist layer: models that unify perception, language, and action, and increasingly reason about space and time.
6 · Evaluation & benchmarks
The most important recent shift: judging world models by closed-loop task success and state-tracking, not pixels.
How to use this map for idea generation
A simple recipe I follow: (1) pick a cluster above; (2) read the surveys to get the taxonomy; (3) read 2–3 methods and their evaluation; (4) look for a mismatch — e.g. metrics that don’t predict task success (World-in-World), or a capability humans have that models lack (state tracking, spatial recall); (5) propose the smallest experiment that closes that gap. The recurring theme across the 2025–2026 literature — visual realism is not utility — is itself a fertile source of problems.
Verification note. Links point to real, recent papers; the Saining Xie / NYU and Meta entries are taken from official pages, and the four “start here” IDs were independently confirmed. Several other entries are preprints that may later change venue or version — confirm before formal citation.