Learning to Drive from a World Model

4 months ago 20

End-to-end autonomous driving policies

By now, most autonomous driving labs agree that building a fully autonomous driving policy based on hard-coded rules and engineered features is doomed to fail. The only realistic way to build an autonomous driving policy that scales to arbitrarily complex and diverse environments is to use methods that scale arbitrarily with computation and data, search and learning . We want a driving policy that is trained end-to-end, and learns to drive from experience like we do.

The need for simulation: off-policy VS on-policy learning

A key challenge in end-to-end learning is how to train a policy that can perform well under the non-i.i.d. assumption made by most supervised learning algorithms such as Behavior Cloning. In the real world, the policy's predictions influence its future observations. Small errors accumulate over time, leading to a compounding effect that drives the system into states it never encountered during pure imitation learning training.

In a previous blog post, we show how a pure imitation learning policy does not recover from its mistakes, leading to a slow drift away from the desired trajectory. To overcome this, the driving policy needs to be trained on-policy, allowing it to learn from its own interactions with the environment, and enabling it to recover from its own mistakes. Running on-policy learning in the real world is costly and impractical , simulation-based training is essential.

Reprojective Driving Simulators

Depth Reprojection

Given a dense depth map, a 6 DOF pose, and an image, we can render a new image by reprojecting the 3D points in the depth map to a new desired pose. This process is called Reprojective Simulation . In practice, we can use a history of images and depth maps to reproject the image to a desired pose and inpaint the missing regions.

We shipped a model trained end-to-end with reprojective simulation to our users for lateral planning in openpilot 0.8.15, and for longitudinal planning in openpilot 0.9.0.

Limitations of Reprojective Simulators

We talk extensively about the limitation of classical reprojective simulation in Learning a Driving Simulator | COMMA_CON 2023, and in Section 3 of the paper. They can be summarized as:

Assumption of a static scene: e.g. swerving towards a neighboring car might cause the driver of the neighboring car to react in the real world, which violates this assumption.

Depth estimation inaccuracies: leading to artifacts in the reprojected image.

Inpainting occlusions: also leading to artifacts in the reprojected image.

Reflections and lighting: a major limitation for night driving scenes, also leading to noticeable lighting artifacts in the reprojected image.

Limited Range: in order to limit the artifacts, we limit the range of simulation to small values (typically less than 4m in translation).

Some artifacts in the new view are correlated with the difference between the new pose and the original pose, and are exploited by the policy to predict the future action. We call this cheating or shortcut learning.

World Model Simulators

World Models

World Models are data-driven simulators. They are generative models predicting the next world state given a history of past states and actions.

World Models can take many forms. The key idea is to represent the state as a latent lower-dimensional representation using a "compressor model," and to model the dynamics of the latent space using "dynamics model."

The current system is based on the Stable Diffusion image VAE and a video Diffusion Transformer

In order to be used as a simulator for training driving policies, the World Model also needs to provide an Action Ground Truth, i.e. the ideal curvature and acceleration given the current state. To do so, we add a "Plan Head" to the dynamics model, which predicts the trajectory to take.

The "Plan Head" is trained using the human path. But only giving the past states to the world model is not enough to make it "recover," it essentially suffers from the off-policy training problems described above.

To overcome this, we "Anchor" the world model to a future state by providing future states at some fixed time step in the future. Knowing where the car is going to be in the future allows the world model to recover from its mistakes and to predict images and plans that converge to the future state.

More implementation details are given in Section 4 of the paper

The importance of future anchoring

A simulation rollout where we command a left deviation of 0.5m then let a future anchored world model (top) and a non future anchored world model (bottom) recover

Controlling the World Model

Similar to the reprojective simulator, we can control the world model by providing a desired 6 DOF pose.

Deviate left Go straight Deviate right

A simulation rollout where we command a left deviation of 0.5m to the world model then let it recover to the future anchored position. Driving Policy Training in Simulation

Putting it all together

Both driving simulators are used to train a driving model using On-Policy Learning.

In practice, we use distributed and asynchronous rollout data collection and model updates, similar to IMPALA and GORILA. More details about how the policy is trained and the evaluation suite are given in Section 5 of the paper.

Citation

For attribution in academic contexts, please cite this work as

"Learning to Drive from a World Model", Autonomy team, comma.ai, 2025.

BibTeX citation

@misc{yousfi2025learningdriveworldmodel, title={Learning to Drive from a World Model}, author={Mitchell Goff and Greg Hogan and George Hotz and Armand du Parc Locmaria and Kacper Raczy and Harald Schäfer and Adeeb Shihadeh and Weixing Zhang and Yassine Yousfi}, year={2025}, eprint={2504.19077}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.19077}, }

Read Entire Article