torchcomms is a new experimental communications API for PyTorch. This provides
both the high level collectives API as well as several out of the box backends.
Alternatively, you can build torchcomms from source. If you want to build the NCCLX backend, we recommend building it under a virtual conda environment.
Run the following commands to build and install torchcomms:
# Create a conda environment
conda create -n torchcomms python=3.10
conda activate torchcomms
# Clone the repository
git clone [email protected]:meta-pytorch/torchcomms.git
cd torchcomms
Build the backend (choose one based on your hardware):
No build needed - uses the library provided by PyTorch
If you want to install the third-party dependencies directly from conda, run the following command:
USE_SYSTEM_LIBS=1 ./build_ncclx.sh
If you want to build and install the third-party dependencies from source, run the following command:
In the example above, we perform the following steps:
new_comm() creates a communicator with the specified backend
Each process gets its unique rank and total world size
Each rank creates a tensor with rank-specific values
All tensors are summed across all ranks
Clean up communication resources
torchcomms also supports asynchronous operations for better performance.
Here is the same example as above, but with asynchronous AllReduce:
importtorchfromtorchcommsimportnew_comm, ReduceOpdevice=torch.device("cuda")
torchcomm=new_comm("nccl", device, name="main_comm")
rank=torchcomm.get_rank()
device_id=rank%torch.cuda.device_count()
target_device=torch.device(f"cuda:{device_id}")
# Create tensortensor=torch.full((1024,), float(rank+1), dtype=torch.float32, device=target_device)
# Start async AllReducework=torchcomm.all_reduce(tensor, ReduceOp.SUM, async_op=True)
# Do other work while communication happensprint(f"Rank {rank}: Doing other work while AllReduce is in progress...")
# Wait for completionwork.wait()
print(f"Rank {rank}: AllReduce completed")
torchcomm.finalize()
Source code is made available under a BSD 3 license, however you may have other legal obligations that govern your use of other content linked in this repository, such as the license or terms of service for third-party data and models.
torchcomms backends include third-party source code may be using other licenses.
Please check the directory and relevant files to verify the license.