* Equal Contribution
⁺ Corresponding Authors
¹ Tsinghua University ² Beijing Zhongguancun Academy
In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering.
Fig. 2 Overview of the VolleyBots Testbed. VolleyBots comprises four key components: (1) Environment, supported by Isaac Sim and PyTorch, which defines entities, observations, actions, and reward functions; (2) Tasks, including 3 single-agent tasks, 3 multi-agent cooperative tasks, and 2 multi-agent competitive tasks; (3) Algorithms, encompassing RL, MARL, and game-theoretic algorithms.
The overview of the VolleyBots testbed is shown in Fig. 2, while the main contributions of this work are summarized as follows:
We introduce VolleyBots, a novel robot sports environment centered on drone volleyball, featuring mixed competitive and cooperative game dynamics, turn-based interactions, and agile 3D maneuvering while demanding both motion control and strategic play.
We release a curriculum of tasks, ranging from single-drone drills to multi-drone cooperative plays and competitive matchups, and baseline evaluations of representative MARL and game-theoretic algorithms, facilitating reproducible research and comparative assessments.
We design a hierarchical policy that achieves a 69.5% win rate against the strongest baseline in the 3 vs 3 task, offering a promising solution for tackling the complex interplay between low-level control and high-level strategy.
Inspired by the way humans progressively learn to play volleyball, we introduce a series of tasks that systematically assess both low-level motion control and high-level strategic play, as shown in Fig. 3.
Fig. 3 Proposed tasks in the VolleyBots testbed, inspired by the process of human learning in volleyball. Single-agent tasks evaluate low-level control, while multi-agent cooperative and competitive tasks integrate high-level decision-making with low-level control.
Single-Agent Tasks:
Back and Forth: The drone sprints between two designated points to complete as many round trips as possible within the time limit.
Hit the Ball: The ball is initialized directly above the drone, and the drone hits the ball once to make it land as far as possible.
Solo Bump: The ball is initialized directly above the drone, and the drone bumps the ball in place to a specific height as many times as possible within the time limit.
Multi-Agent Cooperation:
Bump and Pass: Two drones work together to bump and pass the ball to each other back and forth as many times as possible within the time limit.
Set and Spike (Easy): Two drones take on the role of a setter and an attacker. The setter passes the ball to the attacker, and the attacker then spikes the ball downward to the target region on the opposing side.
Set and Spike (Hard): Similar to Set and Spike (Easy) task, two drones act as a setter and an attacker to set and spike the ball to the opposing side. The difference is that there is a rule-based defense board on the opposing side to intercept the attacker's spike.
Multi-Agent Competition:
1 vs 1: One drone on each side competes against the other in a volleyball match and wins by hitting the ball in the opponent's court. When the ball is on its side, the drone is allowed only one hit to return the ball to the opponent's court.
3 vs 3: Three drones on each side form a team to compete against the other team in a volleyball match. The drones in the same team cooperate to serve, pass, spike, and defend within the standard rule of three hits per side.
6 vs 6: Six drones per side form teams on a full-size court under the standard three-hits-per-side rule of real-world volleyball.
Single-Agent Tasks:
We evaluate two RL algorithms including Deterministic Policy Gradient (DDPG) and Proximal Policy Optimization (PPO) in three single-agent tasks. We also compare their performance under different action spaces including CTBR and PRT. The averaged results over 5 seeds are shown in Table 2.
Table 2 Benchmark result of single-agent tasks with different action spaces, including Collective Thrust and Body Rates (CTBR) and Per-Rotor Thrust (PRT). Back and Forth is evaluated by the number of round trips, Hit the Ball is evaluated by the hitting distance, and Solo Bump is evaluated by the number of bumps achieving a certain height.
Multi-Agent Cooperation:
We evaluate four MARL algorithms including Multi-Agent DDPG (MADDPG), Multi-Agent PPO (MAPPO), Heterogeneous-Agent PPO (HAPPO), Multi-Agent Transformer (MAT) in three multi-agent cooperative tasks. We also compare their performance with and without reward shaping. The averaged results over 5 seeds are shown in Table 3.
Table 3 Benchmark result of multi-agent cooperative tasks with different reward settings including without and with shaping reward. Bump and Pass is evaluated by the number of bumps, Set the Spike (Easy) and Set the Spike (Hard) are evaluated by the success rate.
Multi-Agent Competition:
We evaluate four game-theoretic algorithms including Self-play (SP), Fictitious Self-Play (FSP), Policy-Space Response Oracles (PSRO) with a uniform meta-solver (PSRO_Uniform), and a Nash meta-solver (PSRO_Nash) in multi-agent competitive tasks. We evaluate their performance using approximate exploitability, the average win rate against other learned policies, and Elo rating. The results are shown in Table 4.
Table 4 Benchmark result of multi-agent competitive tasks including 1 vs 1 and 3 vs 3 with different evaluation metrics.
Single-Agent Tasks (Policies are trained by PPO)
Single-Agent Tasks (Policies are trained by PPO)
Multi-Agent Cooperative Tasks (Policies are trained by MAPPO)
Multi-Agent Cooperative Tasks (Policies are trained by MAPPO)
Multi-Agent Competitive Tasks (1v1 policy is trained by FSP, 3v3 policy is trained by SP, 6v6 policy is trained by PSRONash)
Multi-Agent Competitive Tasks (1v1 policy is trained by FSP, 3v3 policy is trained by SP, 6v6 policy is trained by PSRONash)
In the 3 vs 3 task, algorithms learned from scratch exhibit minimal progress, such as learning to serve the ball, but fail to produce other strategic behaviors. We further investigate hierarchical policies as a promising approach.
We first employ the PPO algorithm to develop a set of low-level skill policies, including Hover, Serve, Pass, Set, and Attack. Next, we design a rule-based high-level strategic policy to assign low-level skills to each drone. We evaluate the average win rate of 1000 episodes where the hierarchical policy competes against the SP policy. The results show that the hierarchical policy achieves a significantly higher win rate of 86%.
.png)

