➕ In-network Allreduce

The allreduce operation is one of the most commonly used communication routines in distributed applications. In this operation, vectors coming from different nodes need to be aggregated element-wise (e.g., by summing elements), and the result distributed back to all the nodes. Estimations show up to 40% of the time spent in the training of large-scale machine learning models is spent performing this operation [1]. To improve its bandwidth and the performance of applications using it, this operation can be accelerated by offloading it to network switches. Instead of sending data back-and-forth between nodes, data is aggregated directly inside the network switches. It has been shown that by doing so, performance of the allreduce can be improved by up to 2x.

In this thesis the student will design more efficient in-network allreduce solutions, and implement/evaluate them on a simulator (to be evaluated together, some possibilities are: NS-3 [2], htsim [3], AstraSim [4]). The thesis will be tailored on the student’s expertise, skills, and preferences.

[1] Weiyang Wang et al. “TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs”
[2] NS-3
[3] htsim
[4] AstraSim

Approximate composition: 10% State of the art analysis, 30% Theory/Design, 60% Implementation/Experiments

References

2022

  1. HammingMesh: A Network Topology for Large-Scale Deep Learning
    Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, and 7 more authors
    In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Nov 2022

2021

  1. Flare: Flexible in-Network Allreduce
    Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, and 2 more authors
    In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2021