Design for Learning

4 hours ago 1

This article is based on a position presentation I gave at a workshop preceding the launch of AI Malaysia, in a session focused on reinforcement learning and artificial intelligence in real-world systems. It reflects my perspective as someone who pivoted from mechatronics engineering to fundamental reinforcement learning research, shaped both by my personal blue-sky vision for robotics and by the overarching research directions I have observed in the field.

Overview

At the intersection of reinforcement learning and robotics, I often see two common motivations. One is using reinforcement learning to improve robotic control in well-characterized, applied settings. The other is pursuing generally capable, intelligent autonomous robots. The latter crucially often implies an ability to learn about novel, unforeseen situations and adapt behavior accordingly. In both cases, the dominant approach is sim2real: extensively train a reinforcement learning agent in simulation, and then transfer learned artifacts (e.g., a policy) to the physical robot. But simulation inaccuracies—in modeling the robot or the situations it will encounter—can seriously limit said transfer. Practitioners try to mitigate this "sim2real gap" with domain randomization or privileged information only available in simulation, but when focused on the goal of generally capable, intelligent autonomous robots, I argue that direct learning on hardware is inevitable. I further argue that we should shift some complexity into the hardware design itself, where the development of learning algorithms and robots co-adapt.

Acknowledgements

Thanks to William McNew, Joseph Modayil, and Sorina Lupu for discussions and feedback around this article's content.

1. Separate paths

Learning algorithms and robotics have largely evolved separately, each with its own priorities and constraints. Particularly in reinforcement learning, algorithms were primarily developed with evaluation in simulated environments where you can ignore real-time constraints, run heavy computation between decisions, and safely explore without risking catastrophic system failure. Robotics was driven by the practicalities of making physical systems that reliably perform tasks, reflected in how robots are often classified by whether they use wheels, appendages, or propellers, and in what number. This has led to the design of robots optimized for repeatable performance under well-behaved controllers, and not for surviving arbitrary exploratory behavior.

2. The big world hypothesis and its ramifications for sim2real

Because these paths diverged, the combination of learning algorithms and robotics is often done via imitation learning or sim2real transfer—the latter arguably being the predominant choice today. sim2real requires building a simulator of the robot and its interactions (manipulable objects, terrain, etc.), running a learning algorithm—often leveraging parallel simulation instances and privileged information not available on the physical robot—and hoping the result transfers.

In practice, people may further tightly constrain operating conditions because any unexpected, out-of-distribution observation can lead to arbitrarily unpredictable and potentially dangerous behavior:

Any need to prevent such unexpected scenarios is akin to creating an artificial, small world. However, a generally capable, intelligent autonomous robot has to handle the messy, unfiltered real world. And the world is big...

... Like, really big.

Especially if it contains other intelligent systems of comparable complexity, it's unrealistic to expect a fixed policy trained in simulation to cover every possibility. Likewise, it's impractical for simulation designers to anticipate everything. In a big world scenario (Javed & Sutton, 2024), out-of-distribution events are inevitable and the environment is effectively non-stationary. That pushes us toward methods that track (Sutton et al., 2007) rather than converge: robots that continually experiment and learn from their stream of interaction experience. And importantly, they need to adapt in real-time—otherwise separating data collection from learning (e.g., as explicit separate phases) risks repeated poor decisions in the face of clear, immediate feedback, which can lead to catastrophic system failures.

In a similar vein with work on mitigating the sim2real gap, if we are pursing generally capable, intelligent autonomous robots, I argue that it's critical we address this "hardware-learning" gap:

3. The wear-and-tear paradox

A common reaction to the pitch of learning directly on hardware is that a robot will wear out before it learns anything.

During the exploration stage of learning, the robot often tries noisy actuation patterns that cause jerky motions and severe wear-and-tear of the motors.

... learning on hardware over many hours creates wear on robots that may require human intervention or repair.

It’s a fair concern and absolutely relevant to many robots we have today under common exploration strategies. But it’s also somewhat paradoxical—this is exactly where real-time adaptation shines. If gears develop more backlash or a motor fails, a real-time learning algorithm should treat that as the new reality and continue pursuing its objective in light of the wear.

The objection often hides a subtler tendency: blaming the software when learning directly on hardware seems infeasible. We already have hand-engineered controllers and frozen sim2real policies—software solutions that work well in narrow settings—so it’s easy to assume we just need to keep pushing software forward. In reality, the challenges of on-hardware learning are shared between software and hardware design. Take self-driving cars, often cited as an example where reinforcement learning is considered impractical. If cars had never been invented—and we’d always intended them to be self-driving, learning machines from the start—we’d probably have gone through many more bumper-car iterations along the way.

4. Ways forward

If learning is to happen directly on hardware, we should think of algorithms and robots as co-adapting parts of a larger system. Algorithms shape behavior, and robot design shapes the extent of learning possible.

As a quick thought experiment, consider how many of our robot designs take inspiration from familiar forms—arms, quadrupeds, humanoids—mirroring functional components found in organic learning systems. But if an organism had evolved an actuator equivalent to an electric motor (e.g., in terms of weight, torque, shape, and power requirements), it’s unclear what forms its descendants might have eventually taken. I’m not suggesting we adhere to biological plausibility, but it's worth recognizing how actuator characteristics constrain the Venn diagram of "viable" learning robots.

Below are—in my opinion—some promising directions on both fronts.

4.1 Software opportunities

While this article would like to emphasize the lesser-discussed ways in which hardware can better accommodate learning algorithms, I’d like to briefly highlight promising avenues on the software side:

Continual learning formalisms. Many learning algorithms and their analyses assume that the environment and data distribution are stationary. But in a big world, where an agent—or even a practitioner—can’t account for everything, an agent’s stream of experience is inevitably non-stationary. Progress on this front includes building stronger theoretical formalisms to better define the setting (e.g., Abel et al., 2023; Elelimy et al., 2025), as well as characterizing and addressing challenges like loss of plasticity (Dohare et al., 2024).
Long-horizon solution methods. Many real-world tasks lack clear endpoints and can span arbitrarily long time horizons—an hour-long task at 10 Hz may require temporal credit assignment over 36,000 time steps. Addressing this may involve temporal abstraction (Sutton et al., 1998; Precup, 2000), where higher-level actions can reduce planning complexity, or average-reward formulations (Schwartz, 1993; Mahadevan, 1996), which might be a more natural objective for long timescales.
Better treatment of continuous-time objectives. Physical systems operate in continuous time—where the environment does not pause for an agent's decision—and learning algorithms need to account for this. Naively applying and tuning a discrete-time algorithm in a self-discretized continuous-time environment can lead to instability and changes in the underlying objective (e.g., Munos et al., 2006; Tallec et al., 2019; De Asis & Sutton, 2024). More work is needed on how to best handle time-discretization and—more importantly—how to algorithmically decide when to discretize. These challenges are also closely related to issues of latency and asynchrony. In physical systems, sensing, computation, and actuation often operate with varying delays, introducing timing mismatches that warrant careful consideration (Mahmood et al., 2018).

4.2 Hardware opportunities

Compared to software, hardware is often costlier to iterate on. Nevertheless, smart mechanical design can offload learning complexity, provide helpful inductive biases for learning, and enable real-time adaptation as a learning algorithm's surroundings and embodiment change. I'll outline some directions where hardware innovation can fundamentally expand what's possible for direct learning on robots, complementing what's achievable through software alone.

Cheap and easy maintenance and repair

We should design robots so their parts can be quickly swapped or rebuilt. For instance, open-source designs made from 3D-printed parts and readily available off-the-shelf components make it less concerning when a learning algorithm damages a robot—if broken parts can be easily repaired or replaced.

Unfortunately, the immediate benefit of this desideratum is apparent after a failure; the deeper value is in reducing long-term experimentation and iteration costs.

Rapid reset and recovery

In robotics, any situation that requires external intervention to continue—and humans aren't ordinarily or reasonably expected to be available for assistance—is effectively a terminal state. A crucial distinction is that such terminal states exist beyond the software problem specification (e.g., rewards, time-limits, or practitioner-defined task boundaries). In contrast with simulation, such situations may additionally involve hardware failure and costly repair.

For an autonomous, continually learning system, we should avoid terminal states as much as reasonably possible. A learning algorithm's embodiment (i.e., the physical specification of a robot) is a part of the environment, that its design can heavily influence the transition structure of its state space. If we can design a robot such that its transition dynamics mostly resemble a continuing environment as opposed to an episodic environment, it would be better suited for a learning algorithm:

For example, consider a two-dimensional world with a two-wheeled, box-shaped mobile robot. If it were to flip over, the result would be catastrophic as it might no longer be able to meaningfully function in its environment. In this sense, the flipped state would effectively be a hardware-defined terminal state. If, instead, the robot were triangular with a wheel at each corner, it would be far more robust to such flipping-related failures.

More generally, termination is like a unidirectional transition to a catastrophic situation. This criterion thus encompasses designing such that actions can be undone, e.g., through immediately reversible actions, or by ensuring action sequences exist which can get a robot back to an earlier situation. This aligns with the idea of “reset-free learning,” identified by Zhu et al. (2020) as an important ingredient of real-world robotic reinforcement learning, though their focus was on effective terminal states arising from the (software) problem definition rather than those inherent from the hardware specification.

Robustness to exploratory and suboptimal policies

Learning inevitably involves trying suboptimal—and sometimes risky—actions, so robots should be designed to tolerate the stresses of exploration. Soft robotics offers a promising direction here: robots built to withstand hazardous environments are—by extension—more likely to tolerate their own behavior. Even with rigid-bodied robots, incorporating compliance (e.g., series elastic actuators) can help absorb shocks and reduce damage from exploratory actions.

Further, the square-cube law suggests that scale can play an important role in meeting this criterion. Force scales with cross-sectional area while mass scales with volume, suggesting that at larger scales, the power required to meaningfully actuate a system often simultaneously enables it to exceed the yield stress of its materials. We observe the implications of this in nature: rigid-bodied creatures (akin to most robots) tend to be smaller while larger organisms are often softer or more flexible. This isn't to say that rigid robots are doomed to remain small should they support continual learning. Consider animals: in addition to biological predispositions from evolutionary processes, animals grow over time with the opportunity to generalize knowledge learned at smaller sizes into larger, more capable forms.

5. Concluding remarks

In a big, complex, and unpredictable world, it is impractical for frozen policies or carefully curated simulators to cover every situation. If we want generally capable, intelligent autonomous robots, direct learning on hardware is not just useful—it's inevitable. Toward this goal, we shouldn’t rely primarily on simulators for developing and evaluating learning algorithms, nor should robotics development be largely driven by what’s technically achievable under narrow sets of allowable behaviors. We should consider designing robots with learning in mind from the start, and we should iterate in a way where robots and learning algorithms are aware of each other and co-adapt. This isn’t to say we should abandon simulation or the progress we've made on either front. Rather, it’s to recognize that clever mechanical design—guided by key desiderata for which we might already have solutions—can itself be an enabler of learning.

References

Abel, D., Barreto, A., Van Roy, B., Precup, D., van Hasselt, H., Singh, S. (2023). A definition of continual reinforcement learning. NeurIPS 2023.

De Asis, K., Sutton, R. S. (2024). An idiosyncrasy of time-discretization in reinforcement learning. RLC 2024.

Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., Sutton, R. S. (2024). Loss of plasticity in deep continual learning. Nature.

Elelimy, E., Szepesvari, D., White, M., Bowling, M. (2025). Rethinking the Foundations for Continual Reinforcement Learning. RLC 2025.

Ibarz, J., Finn, C., Kalakrishnan, M., Pastor, P., Levine, S. (2021). How to train your robot with deep reinforcement learning; lessons we've learned. The International Journal of Robotics Research.

Javed, K., Sutton, R. S. (2024). The big world hypothesis and its ramifications for artificial intelligence. Finding the Frame Workshop, RLC 2024.

Mahadevan, S. (1996). Average reward reinforcement learning: foundations, algorithms, and empirical results. Machine Learning.

Mahmood, A. R., Korenkevych, D., Komer, B. J., Bergstra, J. (2018). Setting up a reinforcement learning task with a real-world robot. IROS 2018.

Munos, R. (2006). Policy gradient in continuous time. Journal of Machine Learning Research.

Precup, D. (2000). Temporal abstraction in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst.

Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. ICML 1993.

Sutton, R. S., Koop, A., Silver, D. (2007). On the role of tracking in stationary environments. ICML 2007.

Sutton, R. S., Precup, D., Singh, S. (1998). Between MDPs and semi-MDPs: learning, planning, and representing knowledge at multiple temporal scales. Technical report.

Tallec, C., Blier, L., Ollivier, Y. (2019). Making deep Q-learning methods robust to time discretization. ICML 2019.

Wu, P., Escontrela, A., Hafner, D., Goldberg, K., Abbeel, P. (2022). DayDreamer: World models for physical robot learning. CoRL 2022.

Zhu, H., Yu, J., Gupta, A., Shah, D., Hartikainen, K., Singh, A., Levine, S. (2020). The ingredients of real-world robotic reinforcement learning. ICLR 2020.

Read Entire Article