World Models — Reading Map: Recent Literature (2024–2026)

← Blogs · World Models series

This is a living reading map for the World Models series — a curated, thematically organized guide to the recent literature on world models, active perception, and agentic embodied AI. I treat these papers as the best raw material for idea generation: read across a cluster, find the gap, build there.

It pairs with the conceptual primer in Part 0: From Language Models to World Models, and for ten of these papers there is a structured breakdown — gist, pipeline, benchmark, dataset, results, and improvement/idea angles — in the Deep Dives companion. Annotations are paraphrased in my own words; many entries are fast-moving preprints, so open each link and confirm title/authors/venue before citing in a formal document.

⭐ Start here (four high-leverage reads):

Embodied AI Agents: Modeling the World — the position paper that makes the world model the core of embodied agents.
World-in-World — closed-loop evidence that visual realism ≠ task success.
WorldPrediction — a benchmark showing high-level procedural planning is largely unsolved.
V* — "where to look" as a learnable skill inside a multimodal model.

1 · Foundations & framing

The classics that define the vocabulary, plus the position papers that set today’s agenda.

World ModelsHa & Schmidhuber · NeurIPS 2018The canonical V–M–C blueprint: compress observations, learn latent dynamics, train a tiny controller — and learn inside the model's "dream".

A Path Towards Autonomous Machine IntelligenceLeCun · 2022The JEPA manifesto: a modular, predictive, world-model-centric architecture as a route past pure text prediction.

Embodied AI Agents: Modeling the WorldFung et al. · Meta · 2025Argues perception + reasoning-for-action + memory should be unified inside a world model, and adds a "mental world model" (Theory-of-Mind) layer.

2 · World models for embodied agents

Surveys that organize the field, and the two method families that dominate: video-generation vs. latent-prediction world models.

World Model for Robot Learning: A Comprehensive SurveySurvey · 2026Stresses action-conditioned world models — visually plausible but action-inconsistent futures are of limited value for control.

A Comprehensive Survey on World Models for Embodied AISurvey · 2025A taxonomy across functionality, temporal modeling, and spatial representation.

Solaris: Building a Multiplayer Video World Model in MinecraftSavva, …, Xie · NYU · 2026A multiplayer video world model predicting consistent shared views — directly relevant to multi-agent / shared visual understanding.

Humanoid World Models2025Open foundation world models for humanoid robotics.

LongScape: Long-Horizon Embodied World Models with Context-Aware MoE2025Tackles long-horizon generation with a context-aware mixture of experts.

Rethinking Video Generation Model for the Embodied World2026Reconsiders what video generators need in order to be useful as world models for embodiment.

3 · Active perception — “moving to see better”

Agents that choose where to look or move in order to perceive better — increasingly trained with RL rather than hand-designed next-best-view rules.

Reinforced Embodied Active Defense (Rein-EAD)TPAMI 2025 · TsinghuaAn RL "take a second look" policy with uncertainty-aware dense rewards — the closest existing mechanism to "move to see better".

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMsWu & Xie · CVPR 2024The SEAL framework decides where to look in an image before answering — the model-side analog of active perception.

Vision in Action: Learning Active Perception from Human Demonstrations2025Learns active viewpoint behaviors from human demos.

Real-World Reinforcement Learning of Active Perception Behaviors2025Trains active-sensing behaviors directly in the real world.

Act, Sense, Act: Non-Markovian Active Perception from Egocentric Human Data2026Learns non-Markovian active-perception strategies at scale from egocentric video.

SaPaVe: Active Perception and Manipulation in VLA Models2026Adds active perception + manipulation into vision-language-action models.

4 · Agentic & multi-agent embodied AI

Agents that reason, plan, and coordinate — often LLM/VLM-driven, increasingly as multi-agent systems.

RL4VLM: Fine-Tuning VLMs as Decision-Making Agents via RLZhai et al. (incl. Xie) · NeurIPS 2024The "VLM as an RL agent" recipe — the cleanest template for attaching a reward to a perception-capable agent.

V-IRL: Grounding Virtual Intelligence in Real LifeYang et al. (incl. Xie) · ECCV 2024Grounds virtual agents in real-world data — a bridge between simulation and reality.

Multi-agent Embodied AI: Advances and Future DirectionsSurvey · 2025Frames the perception–action loop across multiple cooperating agents.

Towards Embodied Agentic AI: Review & Classification of LLM/VLM-Driven Robot AutonomySurvey · 2025Architectures where the model acts as coordinator, planner, perception actor, or generalist interface.

5 · Advanced MI — foundation models, spatial cognition & VLAs

The generalist layer: models that unify perception, language, and action, and increasingly reason about space and time.

Cambrian-S: Towards Spatial Supersensing in VideoYang, …, Fei-Fei, Xie · ICLR 2026Perceiving and reasoning about space over time — "moving to see better" at the model level.

Thinking in Space: How Multimodal LLMs See, Remember and Recall SpacesYang et al. (incl. Xie) · CVPR 2025Studies spatial memory — how models build and recall a mental map.

Eyes Wide Shut? Visual Shortcomings of Multimodal LLMsTong et al. (incl. Xie) · CVPR 2024Documents concrete perception failures — motivation that current models still see poorly.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMsTong et al. (incl. Xie) · NeurIPS 2024A strong vision-centric MLLM baseline.

A Survey on Vision-Language-Action Models for Embodied AISurvey · continuously updatedThe reference VLA survey.

6 · Evaluation & benchmarks

The most important recent shift: judging world models by closed-loop task success and state-tracking, not pixels.

World-in-World: World Models in a Closed-Loop WorldZhang et al. · ICLR 2026 (Oral)The first closed-loop platform; finds visual quality does not track task success, and controllability + action-observation post-training matter more.

WorldPrediction: High-level World Modeling & Long-horizon Procedural PlanningChen et al. · Meta · 2025Frontier models reach only ~57% (WM) / ~38% (planning) vs. near-perfect humans — high-level planning is largely unsolved.

WorldModelBench: Judging Video Generation Models as World Models2025A benchmark for evaluating video generators specifically as world models.

VSTAT: Benchmarking Visual State TrackingNYU (incl. Xie) · 2026Best model ~44% vs. humans ~90% — models describe frames but can't track state over time.

Action100M: A Large-scale Video Action DatasetMeta FAIR · 2026~100M hierarchically labeled segments via an automated pipeline — a data substrate for perception/world-model work.

MANIQA: Multi-dimension Attention Network for No-Reference IQAYang et al. · CVPRW 2022A template for scoring human-perceived quality from a single image (SRCC/PLCC vs. MOS) — useful for perception-aligned reward design.

How to use this map for idea generation

A simple recipe I follow: (1) pick a cluster above; (2) read the surveys to get the taxonomy; (3) read 2–3 methods and their evaluation; (4) look for a mismatch — e.g. metrics that don’t predict task success (World-in-World), or a capability humans have that models lack (state tracking, spatial recall); (5) propose the smallest experiment that closes that gap. The recurring theme across the 2025–2026 literature — visual realism is not utility — is itself a fertile source of problems.

Verification note. Links point to real, recent papers; the Saining Xie / NYU and Meta entries are taken from official pages, and the four “start here” IDs were independently confirmed. Several other entries are preprints that may later change venue or version — confirm before formal citation.

← Back to the World Models series