➕ Implementation of the Swing Algorithm on NCCL/RCCL
The allreduce operation is one of the most widely used communication primitives in distributed applications, enabling element-wise aggregation of vectors across multiple nodes, with the result distributed back to all nodes. Up to 40% of the time spent training large-scale machine learning models is attributed to this operation [1].
To improve the bandwidth and performance of the allreduce operation, the Swing algorithm has been proposed [2], which leverages optimized communication patterns to achieve up to a 3x speedup on blocking networks. While Swing has been simulated its integration into GPU-centric libraries such as NCCL [3] and RCCL [4]—the most widely used communication libraries for GPU-based distributed applications—remains unexplored.
In this thesis, the student will implement the Swing algorithm on NCCL/RCCL to enable highly efficient GPU-based collective operations. The implementation will be evaluated on multiple testbeds, including a multi-node GPU cluster and, if available, on supercomputers like LUMI or Leonardo. Experiments will compare the performance of the new implementation with existing algorithms. A solid understanding of GPU programming (CUDA) and parallel programming is required.
[1] Weiyang Wang et al. “TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs”
[2] Swing: Short-cutting Rings for Higher Bandwidth Allreduce - Daniele De Sensi, Tommaso Bonato, David Saam, Torsten Hoefler
[3] NVIDIA NCCL
[4] AMD RCCL
Approximate composition: 10% State of the art analysis, 20% Theory/Design, 70% Implementation/Experiments