Does RL scale?
Over the past few years, we've seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales, and so on, all the way to the point where we can train models with billions of parameters with a scalable objective that can eat up as much data as we can throw at it. Then, what about reinforcement learning (RL)? Does RL also scale like all the other objectives?
Apparently, it does. In 2016, RL achieved superhuman-level performance in games like Go and Chess. Now, RL is solving complex reasoning tasks in math and coding with large language models (LLMs). This is great. However, there is one important caveat: most of the current real-world successes of RL have been achieved with on-policy RL algorithms (e.g., REINFORCE, PPO, GRPO, etc.), which always require fresh, newly sampled rollouts from the current policy, and cannot reuse previous data (note: while PPO-like methods can technically reuse data to some (limited) degree, I'll classify them as on-policy RL, as in OpenAI's documentation). This is not a problem in some settings like board games and LLMs, where we can cheaply generate as many rollouts as we want. However, it is a significant limitation in most real-world problems. For example, in robotics, it takes more than several months in the real world to generate the amount of samples used to post-train a language model with RL, not to mention that a human must be present 24/7 next to the robot to reset it during the entire training time!
On-policy RL can only use fresh data collected by the current policy \(\pi\). Off-policy RL can use any data \(\mathcal{D}\).
This is where off-policy RL comes to the rescue. In principle, off-policy RL algorithms can use any data, regardless of when and how it was collected. Hence, they generally lead to much better sample efficiency, by reusing data many times. For example, off-policy RL can train a dog robot to walk in 20 minutes from scratch in the real world. Q-learning is the most widely used off-policy RL algorithm. It minimizes the following temporal difference (TD) loss: $$\begin{aligned} \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigg[ \Big( Q_\theta(s, a) - \big(r + \gamma \max_{a'} Q_{\bar \theta}(s', a') \big) \Big)^2 \bigg], \end{aligned}$$ where \(\bar \theta\) is the parameter of the target network. Most practical (model-free) off-policy RL algorithms are based on some variants of the TD loss above. So, to apply RL to many real-world problems, the question becomes: does Q-learning (TD learning) scale? If the answer is yes, this would lead to at least an equivalent level of impact as the successes of AlphaGo and LLMs, enabling RL to solve far more diverse and complex real-world tasks very efficiently, in robotics, computer-using agents, and so on.
Q-learning is not yet scalable
Unfortunately, my current belief is that the answer is not yet. I believe current Q-learning algorithms are not readily scalable, at least to long-horizon problems that require more than (say) 100 semantic decision steps.
Let me clarify. My definition of scalability here is the ability to solve more challenging, longer-horizon problems with more data (of sufficient coverage), compute, and time. This notion is different from the ability to solve merely a larger number of (but not necessarily harder) tasks with a single model, which many excellent prior scaling studies have shown to be possible. You can think of the former as the "depth" axis and the latter as the "width" axis. The depth axis is more important and harder to push, because it requires developing more advanced decision-making capabilities.
I claim that Q-learning, in its current form, is not highly scalable along the depth axis. In other words, I believe we still need algorithmic breakthroughs to scale up Q-learning (and off-policy RL) to complex, long-horizon problems. Below, I'll explain two main reasons why I think so: one is anecdotal, and the other is based on our recent scaling study.
Both AlphaGo and DeepSeek are based on on-policy RL and do not use TD learning.
Anecdotal evidence first. As mentioned earlier, most real-world successes of RL are based on on-policy RL algorithms. AlphaGo, AlphaZero, and MuZero are based on model-based RL and Monte Carlo tree search, and do not use TD learning on board games (see 15p of the MuZero paper). OpenAI Five achieves superhuman performance in Dota 2 with PPO (see footnote 6 of the OpenAI Five paper). RL for LLMs is currently dominated by variants of on-policy policy gradient methods, such as PPO and GRPO. Let me ask: do we know of any real-world successes of off-policy RL (1-step TD learning, in particular) on a similar scale to AlphaGo or LLMs? If you do, please let me know and I'll happily update this post.
Of course, I'm not making this claim based only on anecdotal evidence. As said before, I'll show concrete experiments to empirically prove this point later in this post. Also, please don't get me wrong, I'm still highly optimistic about off-policy RL and Q-learning (as an RL researcher who mainly works in off-policy RL!). I just think that we are not there yet, and the purpose of this post is to call for research in RL algorithms, rather than to discourage it!
What's the problem?
Then, what fundamentally makes Q-learning not readily scalable to complex, long-horizon problems, unlike other objectives? Here is my answer: $$\begin{aligned} \definecolor{myblue}{RGB}{89, 139, 231} \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \bigg[ \Big( Q_\theta(s, a) - \underbrace{\big(r + \gamma \max_{a'} Q_{\bar \theta}(s', a') \big)}_{{\color{myblue}\texttt{Biased }} (\textit{i.e., }\neq Q^*(s, a))} \Big)^2 \bigg] \end{aligned}$$ Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning). For example, there are no biases in prediction targets in other scalable objectives (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.) or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).
Biases accumulate over the horizon.
As the problem becomes more complex and the horizon gets longer, the biases in bootstrapped targets accumulate more and more severely, to the point where we cannot easily mitigate them with more data and larger models. I believe this is the main reason why we almost never use larger discount factors (\(\gamma > 0.999\)) in practice, and why it is challenging to scale up Q-learning. Note that policy gradient methods suffer much less from this issue. This is because GAE or similar on-policy value estimation techniques can deal with longer horizons relatively more easily (though at the expense of higher variance), without strict 1-step recursions.
Empirical scaling study
In our recent paper, we empirically verified the above claim via diverse, controlled scaling studies.
We wanted to see whether current off-policy RL methods can solve highly challenging tasks by just scaling up data and compute. To do this, we first prepared highly complex, previously unsolved tasks in OGBench. Here are some videos:
humanoidmaze
humanoidmaze
These tasks are really difficult. To solve them, the agent must learn complex goal-reaching behaviors from unstructured, random (play-style) demonstrations. At test time, the agent must perform precise manipulation, combinatorial puzzle-solving, or long-horizon navigation, over 1,000 environment steps.
We then collected near-infinite data on these environments, to the degree that overfitting is virtually impossible. We also removed as many confounding factors as possible. For example, we focused on offline RL to abstract away exploration. We ensured that the datasets had sufficient coverage, and that all the tasks were solvable from the given datasets. We directly provided the agent with the ground-truth state observations to reduce the burden of representation learning.
Hence, a "scalable" RL algorithm must really be able to solve these tasks, given sufficient data and compute. If Q-learning does not scale even in this controlled setting with near-infinite data, there is little hope that it will scale in more realistic settings, where we have limited data, noisy observations, and so on.
Standard offline RL methods struggle to scale on complex tasks, even with \(1000\times\) more data.
So, how did the existing algorithms work? The results were a bit disappointing. None of the standard, widely used offline RL algorithms (flow BC, IQL, CRL, and SAC+BC) were able to solve all of these tasks, even with 1B-sized datasets, which are \(1000 \times\) larger than typical datasets used in offline RL. More importantly, their performance often plateaued far below the optimal performance. In other words, they didn't scale well on these complex, long-horizon tasks.
You might ask: Are you really sure these tasks are solvable? Did you try larger models? Did you train them for longer? Did you try different hyperparameters? And so on. In the paper, we tried our best to address as many questions as possible with a number of ablations and controlled experiments, showing that none of these fixes worked... except for one:
Horizon reduction makes RL scalable
Recall that my claim earlier was that the horizon (and bias accumulation thereof) is the main obstacle to scaling up off-policy RL. To verify this, we tried diverse horizon reduction techniques (e.g., n-step returns, hierarchical RL, etc.) that reduce the number of biased TD backups.
Horizon reduction was the only technique we found that substantially improved scaling.
The results were promising! Even simple tricks like n-step returns significantly improved scalability and even asymptotic performance (so it is not just a "trick" that merely makes training faster!). Full-fledged hierarchical methods worked even better. More importantly, horizon reduction is the only technique that worked across the board in our experiments. This suggests that simply scaling up data and compute is not enough to address the curse of horizon. In other words, we need better algorithms that directly address this fundamental horizon problem.
Call for research: find a scalable off-policy RL objective
We saw that horizon reduction unlocks the scalability of Q-learning. So are we done? Can we now just scale up Q-learning? I'd say this is only the beginning. While it is great to know the cause and have some solutions, most of the current horizon reduction techniques (n-step returns, hierarchical RL, etc.) only mitigate the issue by a constant factor, and do not fundamentally solve the problem. I think we're currently missing an off-policy RL algorithm that scales to arbitrarily complex, long-horizon problems (or perhaps we may already have a solution, but just haven't stress-tested it enough yet!). I believe finding such a scalable off-policy RL algorithm is the most important missing piece in machine learning today. This will enable solving much more diverse real-world problems, including robotics, language models, agents, and basically any data-driven decision-making tasks.
I'll conclude this post with my thoughts about potential solutions to scalable off-policy RL.
- Can we find a simple, scalable way to extend beyond two-level hierarchies to deal with horizons of arbitrary lengths? Such a solution should be able to naturally form a recursive hierarchical structure, while being simple enough to be scalable. One great example of this (though in a different field) is chain-of-thought in LLMs.
- Another completely different approach (which I intentionally didn't mention so far for simplicity) is model-based RL. We know that model learning is scalable, because it's just supervised learning. We also know that on-policy RL is scalable. So why don't we combine the two, where we first learn a model and run on-policy RL within the model? Would model-based RL indeed scale better than TD-based Q-learning?
- Or is there a way to just completely avoid TD learning? Among the methods that I know of, one such example is quasimetric RL, which is essentially based on the LP formulation of RL. Perhaps this sort of "exotic" RL methods, or MC-based methods like contrastive RL, might eventually scale better than TD-based approaches?
Our setup above can be a great starting point for testing these ideas. We have already designed a set of highly challenging robotic tasks, made the datasets, and verified that they are solvable. One can even make the tasks arbitrarily difficult (e.g., by adding more cubes) and further stress-test the scalability of algorithms in a controlled way. We also put our efforts into making the code as clean as possible. Check out our code!
Feel free to let me know via email/Twitter/X or reach out to me at conferences if you have any questions, comments, or feedback. I hope that at some point, I can write another post about off-policy RL with a more positive title in the near future!
Acknowledgments
I would like to thank Kevin Frans, Hongsuk Choi, Ben Eysenbach, Aviral Kumar, and Sergey Levine for their helpful feedback on this post. This post is partly based on our recent work, Horizon Reduction Makes RL Scalable. The views in this post are my own, and do not necessarily reflect those of my coauthors.