The allreduce operation is one of the most commonly used communication routines in distributed applications. In this operation, vectors coming from different nodes need to be aggregated element-wise (e.g., by summing elements), and the result distributed back to all the nodes. Estimations show up to 40% of the time spent in the training of large-scale machine learnign models is spent performing this operation . To improve its bandwidth and the performance of applications using it, this operation can be accelerated by offloading it to network switches. Instead of sending data back-and-forth between nodes, data is aggregated directly inside the network switches. It has been shown that by doing so, performance of the allreduce can be improved by up to 2x.
In this thesis the student will design more efficient in-network allreduce solutions, and implement/evaluate them on a simulator (to be evaluated together, some possibilities are: NS-3 , htsim , AstraSim ). The thesis will be tailored on the student’s expertise, skills, and preferences.
Approximate composition: 10% State of the art analysis, 30% Theory/Design, 60% Implementation/Experiments
- HammingMesh: A Network Topology for Large-Scale Deep LearningIn Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Nov 2022
- Flare: Flexible in-Network AllreduceIn Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2021