🔬 Benchmarking of MLPerf on Intel Gaudi Systems

MLPerf is a popular benchmark suite designed to evaluate the performance of machine learning (ML) systems across a variety of workloads, such as training and inference tasks. It has become a standard in the ML community for assessing hardware and software performance, providing valuable insights into the efficiency of machine learning systems. The Intel Gaudi architecture [1] is designed for high-performance deep learning workloads and represents Intel’s next-generation AI accelerator.

In this thesis, the student will focus on benchmarking MLPerf on Intel Gaudi systems, comparing the performance of different workloads across multiple configurations. This will involve running MLPerf’s training and inference benchmarks on Intel Gaudi hardware, tuning the system for optimal performance, and analyzing the results to gain insights into the strengths and weaknesses of Intel Gaudi for ML tasks. Additionally, a comparison with NVIDIA H100/B100 systems is planned to evaluate how Intel Gaudi stacks up against other state-of-the-art AI accelerators.

The student will work on setting up the MLPerf benchmark suite, executing the benchmarks on the Gaudi systems, and interpreting the results. If time permits, additional optimizations and potential improvements to the benchmark execution could be explored, such as fine-tuning network configurations or optimizing software libraries.

Skills required:

  • Familiarity with machine learning workflows and benchmarks
  • Experience with high-performance computing (HPC) systems
  • Basic knowledge of GPU programming

[1] Intel Gaudi AI Accelerator
[2] MLPerf Benchmark

Approximate composition: 20% State of the art analysis, 20% Setup/Configuration, 60% Benchmarking/Experiments