End-to-end autonomous driving policies
By now, most autonomous driving labs agree that building a fully autonomous driving policy based on hard-coded rules and engineered features is doomed to fail. The only realistic way to build an autonomous driving policy that scales to arbitrarily complex and diverse environments is to use methods that scale arbitrarily with computation and data, search and learning . We want a driving policy that is trained end-to-end, and learns to drive from experience like we do.
The need for simulation: off-policy VS on-policy learning
A key challenge in end-to-end learning is how to train a policy that can perform well under the non-i.i.d. assumption made by most supervised learning algorithms such as Behavior Cloning. In the real world, the policy's predictions influence its future observations. Small errors accumulate over time, leading to a compounding effect that drives the system into states it never encountered during pure imitation learning training.
In a previous blog post, we show how a pure imitation learning policy does not recover from its mistakes, leading to a slow drift away from the desired trajectory. To overcome this, the driving policy needs to be trained on-policy, allowing it to learn from its own interactions with the environment, and enabling it to recover from its own mistakes. Running on-policy learning in the real world is costly and impractical , simulation-based training is essential.
Reprojective Driving SimulatorsDepth Reprojection
Given a dense depth map, a 6 DOF pose, and an image, we can render a new image by reprojecting the 3D points in the depth map to a new desired pose. This process is called Reprojective Simulation . In practice, we can use a history of images and depth maps to reproject the image to a desired pose and inpaint the missing regions.
We shipped a model trained end-to-end with reprojective simulation to our users for lateral planning in openpilot 0.8.15, and for longitudinal planning in openpilot 0.9.0.
Limitations of Reprojective Simulators
We talk extensively about the limitation of classical reprojective simulation in Learning a Driving Simulator | COMMA_CON 2023, and in Section 3 of the paper. They can be summarized as:
(0.00, 0.00)
World Models
World Models are data-driven simulators. They are generative models predicting the next world state given a history of past states and actions.
World Models can take many forms. The key idea is to represent the state as a latent lower-dimensional representation using a "compressor model," and to model the dynamics of the latent space using "dynamics model."
The current system is based on the Stable Diffusion image VAE and a video Diffusion Transformer
In order to be used as a simulator for training driving policies, the World Model also needs to provide an Action Ground Truth, i.e. the ideal curvature and acceleration given the current state. To do so, we add a "Plan Head" to the dynamics model, which predicts the trajectory to take.
The "Plan Head" is trained using the human path. But only giving the past states to the world model is not enough to make it "recover," it essentially suffers from the off-policy training problems described above.
To overcome this, we "Anchor" the world model to a future state by providing future states at some fixed time step in the future. Knowing where the car is going to be in the future allows the world model to recover from its mistakes and to predict images and plans that converge to the future state.
More implementation details are given in Section 4 of the paper
The importance of future anchoring
Controlling the World Model
Similar to the reprojective simulator, we can control the world model by providing a desired 6 DOF pose.
Deviate left Go straight Deviate right
Putting it all together
Both driving simulators are used to train a driving model using On-Policy Learning.
In practice, we use distributed and asynchronous rollout data collection and model updates, similar to IMPALA and GORILA. More details about how the policy is trained and the evaluation suite are given in Section 5 of the paper.
Citation
For attribution in academic contexts, please cite this work as
"Learning to Drive from a World Model", Autonomy team, comma.ai, 2025.BibTeX citation
@misc{yousfi2025learningdriveworldmodel, title={Learning to Drive from a World Model}, author={Mitchell Goff and Greg Hogan and George Hotz and Armand du Parc Locmaria and Kacper Raczy and Harald Schäfer and Adeeb Shihadeh and Weixing Zhang and Yassine Yousfi}, year={2025}, eprint={2504.19077}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.19077}, }