➕ Implementation of the Swing Algorithm on NCCL/RCCL

The allreduce operation is one of the most widely used communication primitives in distributed applications, enabling element-wise aggregation of vectors across multiple nodes, with the result distributed back to all nodes. Up to 40% of the time spent training large-scale machine learning models is attributed to this operation [1].

To improve the bandwidth and performance of the allreduce operation, the Swing algorithm has been proposed [2], which leverages optimized communication patterns to achieve up to a 3x speedup on blocking networks. While Swing has been simulated its integration into GPU-centric libraries such as NCCL [3] and RCCL [4]—the most widely used communication libraries for GPU-based distributed applications—remains unexplored.

In this thesis, the student will implement the Swing algorithm on NCCL/RCCL to enable highly efficient GPU-based collective operations. The implementation will be evaluated on multiple testbeds, including a multi-node GPU cluster and, if available, on supercomputers like LUMI or Leonardo. Experiments will compare the performance of the new implementation with existing algorithms. A solid understanding of GPU programming (CUDA) and parallel programming is required.

[1] Weiyang Wang et al. “TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs”
[2] Swing: Short-cutting Rings for Higher Bandwidth Allreduce - Daniele De Sensi, Tommaso Bonato, David Saam, Torsten Hoefler
[3] NVIDIA NCCL
[4] AMD RCCL

Approximate composition: 10% State of the art analysis, 20% Theory/Design, 70% Implementation/Experiments

References

2024

NSDI24
Swing: Short-cutting Rings for Higher Bandwidth Allreduce

Daniele De Sensi, Tommaso Bonato, David Saam, and 1 more author

In 21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Apr 2024

Abs arXiv Bib PDF Video

The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that keeps a low distance between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.
@inproceedings{swing, author = {De Sensi, Daniele and Bonato, Tommaso and Saam, David and Hoefler, Torsten}, title = {Swing: Short-cutting Rings for Higher Bandwidth Allreduce}, booktitle = {21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)}, year = {2024}, address = {Santa Clara, CA}, publisher = {USENIX Association}, month = apr, dimensions = {true}, }