Publications | Daniele De Sensi

Legend:

Conference/Workshop Journal arXiv/Poster/Other

2025

SC25
Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality

Daniele De Sensi, Saverio Pasqualoni, Lorenzo Piarulli, and 5 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25), Nov 2025

Abs Bib

As high-performance computing (HPC) systems grow, optimizing communication locality becomes essential for performance. HPC networks are often oversubscribed, consisting of fully connected groups that are sparsely connected. We introduce Binomial Negabinary (Bine) trees, a novel approach to enhance collective operations by reducing inter-group communication. They minimize the distance between communicating ranks, reducing traffic on global links and alleviating congestion. Unlike traditional hierarchical algorithms, Bine trees are topology-agnostic and do not assume a uniform partition of ranks, making them ideal for production supercomputers with irregular process allocations. We design algorithms for eight collectives, achieving up to 5x speedups and 33% less global traffic on four supercomputers with four different topologies. Our results emphasize their effectiveness in improving performance while reducing the load on global links.
@inproceedings{bine, author = {De Sensi, Daniele and Pasqualoni, Saverio and Piarulli, Lorenzo and Bonato, Tommaso and Ba, Seydou and Turisini, Matteo and Domke, Jens and Hoefler, Torsten}, title = {Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality}, year = {2025}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25)}, doi = {To Appear}, dimensions = {true}, }
SC25
Uno: A One-Stop Solution for Inter- and Intra-Data Center Congestion Control and Reliable Connectivity

Tommaso Bonato, Sepehr Abdous, Abdul Kabbani, and 11 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’25), Nov 2025

Abs Bib

Cloud computing and AI workloads are driving unprecedented demand for efficient communication within and across datacenters. However, the coexistence of intra- and inter-datacenter traffic within datacenters plus the disparity between the RTTs of intra- and inter-datacenter networks complicates congestion management and traffic routing. Particularly, faster congestion responses of intra-datacenter traffic causes rate unfairness when competing with slower inter-datacenter flows. Additionally, inter-datacenter messages suffer slow loss recovery and, thus, require reliability. Alas, existing solutions overlook these challenges and handle inter- and intra-datacenter congestion with separate control loops or at different granularities. We propose Uno, a unified system for both inter- and intra-DC environments that integrates a transport protocol for rapid congestion reaction and fair rate control with a load balancing scheme that combines erasure coding and adaptive routing. Our findings show that Uno significantly improves the completion times of both inter- and intra-DC flows compared to state-of-the-art methods such as Gemini.
@inproceedings{uno, author = {Bonato, Tommaso and Abdous, Sepehr and Kabbani, Abdul and Ghalayini, Ahmad and Gebara, Nadeen and Lam, Terry and Agarwal, Anup and Chen, Tiancheng and Yu, Zhuolong and Taranov, Konstantin and Elhaddad, Mahmoud and De Sensi, Daniele and Ghorbani, Soudeh and Hoefler, Torsten}, title = {Uno: A One-Stop Solution for Inter- and Intra-Data Center Congestion Control and Reliable Connectivity}, year = {2025}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'25)}, doi = {To Appear}, dimensions = {true}, }

DSD25

Pioneering the Future of Interconnects for Supercomputing and AI

Michele Martinelli, Roberto Ammendola, Andrea Biagioni, and 41 more authors

In Proceedings of the Euromicro Conference Series on Digital System Design (DSD’25), Sep 2025

Abs Bib

NET4EXA aims to develop a next-generation high-performance interconnect for HPC and AI systems, addressing the increasing demands of large-scale infrastructures, such as those required for training Large Language Models. Building upon the proven BXI (Bull eXascale Interconnect) European technology used in TOP15 supercomputers, NET4EXA will deliver the new BXI release, BXIv3, a complete hardware and software interconnect solution, including switch and network interface components. The project will integrate a fully functional pilot system at TRL 8, ready for deployment into upcoming exascale and post-exascale systems from 2025 onward. Leveraging prior research from European initiatives like RED-SEA, the previous achievements of consortium partners, and over 20 years of expertise from BULL, NET4EXA also lays the groundwork for the future generation of BXI, BXIv4, providing analysis and preliminary design. The project will use a hybrid development and co-design approach, combining commercial switch technology with custom IP and FPGA-based NICs. Performance of NET4EXA BXIv3 interconnection will be evaluated using a broad portfolio of benchmarks, scientific scalable applications and AI workloads.

@inproceedings{net4exa,
  author = {Martinelli, Michele and Ammendola, Roberto and Biagioni, Andrea and Chiarini, Carlotta and Frezza, Ottorino and Lo Cicero, Francesca and Lonardo, Alessandro and Paolucci, Pier Stanislao and Pastorelli, Elena and Perticaroli, Pierpaolo and Pontisso, Luca and Rossi, Cristian and Simula, Francesco and Vicini, Piero and Colin, David and Pichon, Grégoire and Louvet, Alexandre and Gliksberg, John and Chen, Claire and Turisini, Matteo and Monterubbiano, Andrea and Nominé, Jean-Philippe and Dutoit, Denis and Taboada, Hugo and Zaourar, Lilia and Benazouz, Mohamed and Bilas, Angelos and Chaix, Fabien and Katevenis, Manolis and Chrysos, Nikolaos and Mageiropoulos, Evangelos and Kozanitis, Christos and Moen, Thomas and Persvold, Steffen and Rustad, Einar and Fiore, Sandro and Pezzuto, Simone and Granelli, Fabrizio and Potestio, Raffaello and Tubiana, Luca and Velha, Philippe and Vella, Flavio and De Sensi, Daniele and Pontarelli, Salvatore},
  booktitle = {Proceedings of the Euromicro Conference Series on Digital System Design (DSD'25)},
  title = {{Pioneering the Future of Interconnects for Supercomputing and AI}},
  year = {2025},
  dimensions = {true},
  volume = {},
  issn = {},
  doi = {To Appear},
  url = {https://doi.ieeecomputersociety.org/10.1109/SEAA60479.2023.00068},
  publisher = {IEEE Computer Society},
  address = {Los Alamitos, CA, USA},
  month = sep
}

2024

CACM
HammingMesh: A Network Topology for Large-Scale Deep Learning

Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, and 6 more authors

Communications of the ACM, Nov 2024

Online First

Abs Bib

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.
@article{10.1145/3623490, author = {Hoefler, Torsten and Bonato, Tommaso and De Sensi, Daniele and Di Girolamo, Salvatore and Li, Shigang and Heddes, Marco and Goel, Deepak and Castro, Miguel and Scott, Steve}, title = {HammingMesh: A Network Topology for Large-Scale Deep Learning}, year = {2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, issn = {0001-0782}, url = {https://doi.org/10.1145/3623490}, doi = {10.1145/3623490}, note = {Online First}, journal = {Communications of the ACM}, month = nov, numpages = {15}, dimensions = {true}, }
arXiv
The Landscape of GPU-Centric Communication

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, and 4 more authors

In arXiv, Sep 2024

Abs arXiv Bib PDF

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
@inproceedings{gpucentriccomm, author = {Unat, Didem and Turimbetov, Ilyas and Kefah Taha Issa, Mohammed and Sagbili, Dogan and Vella, Flavio and De Sensi, Daniele and Ismayilov, Ismayil}, title = {The Landscape of GPU-Centric Communication}, booktitle = {arXiv}, year = {2024}, month = sep, dimensions = {true}, }
IXPUG24
Benchmarking Ethernet Interconnect for HPC/AI workloads

Lorenzo Pichetti, Daniele De Sensi, Karthee Sivalingam, and 6 more authors

In Proceedings of the Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’24 Workshops), Nov 2024

Abs Bib PDF

Interconnects have always played a cornerstone role in HPC. Since the inception of the Top500 ranking, interconnect statistics have been predominantly dominated by two compet- ing technologies: InfiniBand and Ethernet. However, even if Ethernet increased its popularity due to versatility and cost- effectiveness, InfiniBand used to provide higher bandwidth and continues to feature lower latency. Industry seeks for a further evolution of the Ethernet standards to enable fast and low- latency interconnect for emerging AI workloads by offering competitive, open-standard solutions. This paper analyzes the early results obtained from two systems relying on an HPC Ethernet interconnect, one relying on 100G and the other on 200G Ethernet. Preliminary findings indicate that the Ethernet-based networks exhibit competitive performance, closely aligning with InfiniBand, especially for large message exchanges.
@inproceedings{ethernethuawei, author = {Pichetti, Lorenzo and De Sensi, Daniele and Sivalingam, Karthee and Nassyr, Stepan and Cesarini, Daniele and Turisini, Matteo and Pleiter, Dirk and Artigiani, Aldo and Vella, Flavio}, title = {Benchmarking Ethernet Interconnect for HPC/AI workloads}, year = {2024}, month = nov, booktitle = {Proceedings of the Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24 Workshops)}, doi = {10.1109/SCW63240.2024.00124}, dimensions = {true}, }
SC24
Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, and 11 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’24), Nov 2024

Abs arXiv Bib PDF

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers — Alps, Leonardo, and LUMI — each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
@inproceedings{gpugpuinterconnect, author = {De Sensi, Daniele and Pichetti, Lorenzo and Vella, Flavio and De Matteis, Tiziano and Ren, Zebin and Fusco, Luigi and Turisini, Matteo and Cesarini, Daniele and Lust, Kurt and Trivedi, Animesh and Roweth, Duncan and Spiga, Filippo and Di Girolamo, Salvatore and Hoefler, Torsten}, title = {Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects}, year = {2024}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)}, doi = {10.1109/SC41406.2024.00039}, dimensions = {true}, }
ATC24
OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs

Mikhail Khalilov, Marcin Chrapek, Siyuan Shen, and 7 more authors

In Proceedings of the 2024 USENIX Annual Technical Conference (ATC), 2024

Abs arXiv Bib

Multi-tenancy is essential for unleashing SmartNIC’s potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to standard NIC data paths with a well-defined set of offloaded functions, unpredictable execution times of SmartNIC kernels make conventional approaches for multi-tenancy and QoS insufficient. We fill this gap with OSMOSIS, a SmartNICs resource manager co-design. OSMOSIS extends existing OS mechanisms to enable dynamic hardware resource multiplexing on top of the on-path packet processing data plane. We implement OSMOSIS within an open-source RISC-V-based 400Gbit/s SmartNIC. Our performance results demonstrate that OSMOSIS fully supports multi-tenancy and enables broader adoption of SmartNICs in datacenters with low overhead.
@inproceedings{osmosis, author = {Khalilov, Mikhail and Chrapek, Marcin and Shen, Siyuan and Benz, Thomas and Vezzu, Alessandro and Di Girolamo, Salvatore and Schneider, Timo and De Sensi, Daniele and Benini, Luca and Hoefler, Torsten}, title = {OSMOSIS: Enabling Multi-Tenancy in Datacenter SmartNICs}, booktitle = {Proceedings of the 2024 USENIX Annual Technical Conference (ATC)}, year = {2024}, address = {}, publisher = {}, month = {}, dimensions = {true}, }
HPDC24
Near-Optimal Wafer-Scale Reduce

Piotr Luczynski, Lukas Gianinazzi, Patrick Iff, and 3 more authors

In Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

Abs arXiv Bib

Efficient Reduce and AllReduce communication collectives are a critical cornerstone of high-performance computing (HPC) applications. We present the first systematic investigation of Reduce and AllReduce on the Cerebras Wafer-Scale Engine (WSE). This architecture has been shown to achieve unprecedented performance both for machine learning workloads and other computational problems like FFT. We introduce a performance model to estimate the execution time of algorithms on the WSE and validate our predictions experimentally for a wide range of input sizes. In addition to existing implementations, we design and implement several new algorithms specifically tailored to the architecture. Moreover, we establish a lower bound for the runtime of a Reduce operation on the WSE. Based on our model, we automatically generate code that achieves near-optimal performance across the whole range of input sizes. Experiments demonstrate that our new Reduce and AllReduce algorithms outperform the current vendor solution by up to 3.27\texttimes. Additionally, our model predicts performance with less than 4% error. The proposed communication collectives increase the range of HPC applications that can benefit from the high throughput of the WSE. Our model-driven methodology demonstrates a disciplined approach that can lead the way to further algorithmic advancements on wafer-scale architectures.
@inproceedings{cerebras_reduce, author = {Luczynski, Piotr and Gianinazzi, Lukas and Iff, Patrick and Wilson, Leighton and De Sensi, Daniele and Hoefler, Torsten}, title = {Near-Optimal Wafer-Scale Reduce}, year = {2024}, isbn = {9798400704130}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3625549.3658693}, doi = {10.1145/3625549.3658693}, booktitle = {Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing}, pages = {334–347}, numpages = {14}, keywords = {communication collectives, message passing, reduction}, location = {Pisa, Italy}, series = {HPDC '24}, dimensions = {true}, }
NSDI24
A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Nils Blach, Maciej Besta, Daniele De Sensi, and 10 more authors

In 21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Apr 2024

Abs arXiv Bib PDF

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF’s strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source)1 routing architecture is fully portable and applicable to accelerate any low-diameter interconnect.
@inproceedings{slimfly, author = {Blach, Nils and Besta, Maciej and De Sensi, Daniele and Domke, Jens and Harake, Hussein and Li, Shigang and Iff, Patric and Konieczny, Marek and Lakhotia, Kartik and Kubicek, Ales and Ferrari, Marcel and Petrini, Fabrizio and Hoefler, Torsten}, title = {A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network}, booktitle = {21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)}, year = {2024}, address = {Santa Clara, CA}, publisher = {USENIX Association}, month = apr, dimensions = {true}, }
NSDI24
Swing: Short-cutting Rings for Higher Bandwidth Allreduce

Daniele De Sensi, Tommaso Bonato, David Saam, and 1 more author

In 21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Apr 2024

Abs arXiv Bib PDF Video

The allreduce collective operation accounts for a significant fraction of the runtime of workloads running on distributed systems. One factor determining its performance is the distance between communicating nodes, especially on networks like torus, where a higher distance implies multiple messages being forwarded on the same link, thus reducing the allreduce bandwidth. Torus networks are widely used on systems optimized for machine learning workloads (e.g., Google TPUs and Amazon Trainium devices), as well as on some of the Top500 supercomputers. To improve allreduce performance on torus networks we introduce Swing, a new algorithm that keeps a low distance between communicating nodes by swinging between torus directions. Our analysis and experimental evaluation show that Swing outperforms by up to 3x existing allreduce algorithms for vectors ranging from 32B to 128MiB, on different types of torus and torus-like topologies, regardless of their shape and size.
@inproceedings{swing, author = {De Sensi, Daniele and Bonato, Tommaso and Saam, David and Hoefler, Torsten}, title = {Swing: Short-cutting Rings for Higher Bandwidth Allreduce}, booktitle = {21th USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)}, year = {2024}, address = {Santa Clara, CA}, publisher = {USENIX Association}, month = apr, dimensions = {true}, }
FGCS
Canary: Congestion-aware in-network allreduce using dynamic trees

Daniele De Sensi, Edgar Costa Molero, Salvatore Di Girolamo, and 2 more authors

Future Generation Computer Systems, Apr 2024

Abs arXiv Bib PDF

The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in the network. Switches aggregate data coming from multiple ports before forwarding the partially aggregated result to the next hop. In all existing solutions, each switch needs to know the ports from which it will receive the data to aggregate. However, this forces packets to traverse a predefined set of switches, making these solutions prone to congestion. For this reason, we design Canary, the first congestion-aware in-network allreduce algorithm. Canary uses load balancing algorithms to forward packets on the least congested paths. Because switches do not know from which ports they will receive the data to aggregate, they use timeouts to aggregate the data in a best-effort way. We develop a P4 Canary prototype and evaluate it on a Tofino switch. We then validate Canary through simulations on large networks, showing performance improvements up to 40% compared to the state-of-the-art.
@article{canary, title = {Canary: Congestion-aware in-network allreduce using dynamic trees}, journal = {Future Generation Computer Systems}, volume = {152}, pages = {70-82}, year = {2024}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2023.10.010}, url = {https://www.sciencedirect.com/science/article/pii/S0167739X23003850}, author = {{De Sensi}, Daniele and {Costa Molero}, Edgar and {Di Girolamo}, Salvatore and Vanbever, Laurent and Hoefler, Torsten}, keywords = {In-network compute, Allreduce, Load balancing}, dimensions = {true}, }

2022

SIGMETRICS23
Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, and 3 more authors

Proc. ACM Meas. Anal. Comput. Syst., Dec 2022

Abs arXiv Bib PDF Code Video

Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce \textitnetwork noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.
@article{cloudnoise, author = {De Sensi, Daniele and De Matteis, Tiziano and Taranov, Konstantin and Di Girolamo, Salvatore and Rahn, Tobias and Hoefler, Torsten}, title = {Noise in the Clouds: Influence of Network Performance Variability on Application Scalability}, year = {2022}, issue_date = {December 2022}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {6}, number = {3}, doi = {10.1145/3570609}, journal = {Proc. ACM Meas. Anal. Comput. Syst.}, month = dec, articleno = {49}, numpages = {27}, keywords = {cloud; HPC; network noise; scalability;}, eprint = {2210.15315}, dimensions = {true}, }
SC22
HammingMesh: A Network Topology for Large-Scale Deep Learning

Torsten Hoefler, Tommaso Bonato, Daniele De Sensi, and 7 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Nov 2022

Abs arXiv Bib PDF Code Video Best Reproducibility Advancement Award

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic. Thus, HammingMesh will power future large-scale deep learning systems with extreme bandwidth requirements.
@inproceedings{hxmesh, author = {Hoefler, Torsten and Bonato, Tommaso and De Sensi, Daniele and Di Girolamo, Salvatore and Li, Shigang and Heddes, Marco and Belk, Jon and Goel, Deepak and Castro, Miguel and Scott, Steve}, title = {HammingMesh: A Network Topology for Large-Scale Deep Learning}, year = {2022}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'22)}, award = {Best Reproducibility Advancement Award}, doi = {10.1109/sc41404.2022.00016}, eprint = {2209.01346}, dimensions = {true}, }
SC22
Building Blocks for Network-Accelerated Distributed File Systems

Salvatore Di Girolamo, Daniele De Sensi, Konstantin Taranov, and 5 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22), Nov 2022

Abs arXiv Bib PDF Code Video Best Paper Finalist

High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reducing overall latency and CPU utilization. Yet, this approach still involves CPUs on the data path to enforce storage policies such as authentication, replication, and erasure coding. We show how storage policies can be offloaded to fully programmable SmartNICs, without involving host CPUs. By using PsPIN, an open-hardware SmartNIC, we show latency improvements for writes (up to 2x), data replication (up to 2x), and erasure coding (up to 2x), when compared to respective CPU- and RDMA-based alternatives.
@inproceedings{spin-dfs, author = {Di Girolamo, Salvatore and De Sensi, Daniele and Taranov, Konstantin and Malesevic, Milos and Besta, Maciej and Schneider, Timo and Kistler, Severin and Hoefler, Torsten}, title = {Building Blocks for Network-Accelerated Distributed File Systems}, year = {2022}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'22)}, dimensions = {true}, award = {Best Paper Finalist} }
CCS22
NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications

Konstantin Taranov, Benjamin Rothenberger, Daniele De Sensi, and 2 more authors

In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (CCS ’22), Nov 2022

Abs arXiv Bib PDF Video Honorable Mention Award

This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities in RDMA protocols that unveils several attack vectors on RDMA-enabled applications and the NVMe-oF protocol, showing that the current security mechanisms of the NVMe-oF protocol do not address the security vulnerabilities posed by the use of RDMA. In particular, we show how an unprivileged user can inject packets into any RDMA connection created on a local network controller, bypassing security mechanisms of the operating system and its kernel, and how the injection can be used to acquire unauthorized block access to NVMe-oF devices. Overall, we implement four attacks on RDMA protocols and seven attacks on the NVMe-oF protocol and verify them on the two most popular implementations of NVMe-oF: SPDK and the Linux kernel. To mitigate the discovered attacks we propose multiple mechanisms that can be implemented by RDMA and NVMe-oF providers.
@inproceedings{ccs2022, author = {Taranov, Konstantin and Rothenberger, Benjamin and De Sensi, Daniele and Perrig, Adrian and Hoefler, Torsten}, title = {NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications}, year = {2022}, month = nov, booktitle = {Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (CCS '22)}, award = {Honorable Mention Award}, dimensions = {true} }

2021

TPDS
Power LognRoll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging Protocols

K. Dichev, D. De Sensi, D. S. Nikolopoulos, and 2 more authors

IEEE Transactions on Parallel & Distributed Systems, Aug 2021

Abs Bib

In fault tolerance for parallel and distributed systems, message logging protocols have played a prominent role in the last three decades. Such protocols enable local rollback to provide recovery from fail-stop errors. Global rollback techniques can be straightforward to implement but at times lead to slower recovery than local rollback. Local rollback is more complicated but can offer faster recovery times. In this work, we study the power and energy efficiency implications of global and local rollback. We propose a power-efficient version of local rollback to reduce power consumption for non-critical, blocked processes, using Dynamic Voltage and Frequency Scaling (DVFS) and clock modulation (CM). Our results for 3 different MPI codes on 2 parallel systems show that power-efficient local rollback reduces CPU energy waste up to 50% during the recovery phase, compared to existing global and local rollback techniques, without introducing significant overheads. Furthermore, we show that savings manifest for all blocked processes, which grow linearly with the process count. We estimate that for settings with high recovery overheads the total energy waste of parallel codes is reduced with the proposed local rollback.
@article{tpds2021, author = {Dichev, K. and De Sensi, D. and Nikolopoulos, D. S. and Cameron, K. W. and Spence, I.}, journal = {IEEE Transactions on Parallel & Distributed Systems}, title = {Power LognRoll: Power-Efficient Localized Rollback for MPI Applications Using Message Logging Protocols}, year = {2021}, volume = {}, number = {01}, issn = {1558-2183}, pages = {1-1}, keywords = {protocols;fault tolerant systems;fault tolerance;runtime;resilience;payloads;topology}, doi = {10.1109/TPDS.2021.3107745}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, url = {https://www.computer.org/csdl/journal/td/5555/01/09524502/1wpqI1H5fqg}, month = aug, dimensions = {true} }
SC21
Flare: Flexible in-Network Allreduce

Daniele De Sensi, Salvatore Di Girolamo, Saleh Ashkboos, and 2 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Aug 2021

Abs arXiv Bib PDF Code Slides Video

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches.
@inproceedings{flare, author = {De Sensi, Daniele and Di Girolamo, Salvatore and Ashkboos, Saleh and Li, Shigang and Hoefler, Torsten}, title = {Flare: Flexible in-Network Allreduce}, year = {2021}, isbn = {9781450384421}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3458817.3476178}, doi = {10.1145/3458817.3476178}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {35}, numpages = {16}, keywords = {allreduce, programmable switch, in-network computing}, location = {St. Louis, Missouri}, series = {SC '21}, dimensions = {true}, }

2020

SC20
An In-Depth Analysis of the Slingshot Interconnect

Daniele De Sensi, Salvatore Di Girolamo, Kim H. McMahon, and 2 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Aug 2020

Abs arXiv Bib PDF Code Slides Video

The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyper-scale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.
@inproceedings{sc2020, author = {De Sensi, Daniele and Di Girolamo, Salvatore and McMahon, Kim H. and Roweth, Duncan and Hoefler, Torsten}, title = {An In-Depth Analysis of the Slingshot Interconnect}, year = {2020}, isbn = {9781728199986}, url = {https://dl.acm.org/doi/abs/10.5555/3433701.3433747}, publisher = {IEEE Press}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {35}, numpages = {14}, keywords = {dragonfly, exascale, interconnection network, congestion, datacenters}, location = {Atlanta, Georgia}, series = {SC '20}, doi = {10.1109/sc41405.2020.00039}, dimensions = {true}, }
Energies
Performance and Energy Trade-Offs for Parallel Applications on Heterogeneous Multi-Processing

Demetrios A. M. Coutinho, Daniele De Sensi, Arthur Francisco Lorenzon, and 4 more authors

Energies, Aug 2020

Abs Bib PDF

This work proposes a methodology to find performance and energy trade-offs for parallel applications running on single ISA Heterogeneous Multi-Processing (HMP) systems. These offer flexibility in the form of different core types and voltage and frequency pairings, defining a vast design space exploration. Therefore, for a given application, choosing a configuration to optimize the performance and energy consumption is not straightforward. Our method proposes novel analytic models for performance and power consumption whose parameters can be fitted using only a few sampled off-line measurements. These models are then used to estimate an application’s performance and energy consumption for the whole configuration space. In turn, these off-line predictions define the choice of estimated Pareto-optimal configurations of the model, which are used to inform the selection of the configuration that the application should be executed on. The methodology was validated on an ODROID-XU3 board for eight programs from the PARSEC Benchmark, Phoronix Test Suite and Rodinia applications. The generated Pareto-optimal configuration space represented an overall reduction of nearly 99% from the universe of all available configurations. Energy savings of up to 59.77%, 61.38% and 17.7% were observed when compared to the ’performance’, ’ondemand’ and ’powersave’ Linux governors, respectively, with better or similar performance.
@article{energies_dem, author = {Coutinho, Demetrios A. M. and De Sensi, Daniele and Lorenzon, Arthur Francisco and Georgiou, Kyriakos and Nunez-Yanez, Jose and Eder, Kerstin and Xavier-de-Souza, Samuel}, title = { Performance and Energy Trade-Offs for Parallel Applications on Heterogeneous Multi-Processing}, journal = {Energies}, year = {2020}, openaccess = {https://www.mdpi.com/1996-1073/13/9/2409}, doi = {10.3390/en13092409}, dimensions = {true}, }
Access
Truly Scalable K-Truss and Max-Truss Algorithms for Community Detection in Graphs

A. Conte, D. De Sensi, R. Grossi, and 2 more authors

IEEE Access, Aug 2020

Abs Bib

The notion of k-truss has been introduced a decade ago in social network analysis and security for community detection, as a form of cohesive subgraphs less stringent than a clique (set of pairwise linked nodes), and more selective than a k-core (induced subgraph with minimum degreek). A k-truss is an inclusion-maximal subgraph H in which each edge belongs to at least k−2 triangles inside H. The truss decomposition establishes, for each edge e, the maximum k for which e belongs to a k-truss. Analogously to the largest clique and to the maximum k-core, the strongest community for k-truss is the max-truss, which corresponds to the k-truss having the maximum k. Even though the computation of truss decomposition and of the max-truss takes polynomial time, on a large scale, it suffers from handling a potentially cubic number of wedges. In this paper, we provide a new algorithm FMT, which advances the state of the art on different sides: lower execution time, lower memory usage, and no need for expensive hardware. We compare FMT experimentally with the most recent state-of-the-art algorithms on a set of large real-world and synthetic networks with over a billion edges. The massive improvement allows FMT to compute the max-truss of networks of tens of billions of edges on a single standard server machine.
@article{9146824, author = {Conte, A. and De Sensi, D. and Grossi, R. and Marino, A. and Versari, L.}, journal = {IEEE Access}, title = {Truly Scalable K-Truss and Max-Truss Algorithms for Community Detection in Graphs}, year = {2020}, volume = {}, number = {}, pages = {1-14}, openaccess = {https://ieeexplore.ieee.org/document/9146824}, doi = {10.1109/ACCESS.2020.3011667}, issn = {2169-3536}, dimensions = {true} }
IJPP
Improving the performance of Actors on Multi-Cores with Parallel Patterns

Luca Rinaldi, Massimo Torquati, Daniele De Sensi, and 2 more authors

International Journal of Parallel Programming, Aug 2020

Abs Bib PDF

The Actor-based programming model is largely used in the context of distributed systems for its message-passing semantics and neat separation between the concurrency model and the underlying HW platform. However, in the context of a single multi-core node where the performance metric is the primary optimization objective, the ’pure’ Actor Model is generally not used because Actors cannot exploit the physical shared-memory, thus reducing the optimization options. In this work, we propose to enrich the Actor Model with some well-known Parallel Patterns to face the performance issues of using the ’pure’ Actor Model on a single multi-core platform. In the experimental study, conducted on two different multi-core systems by using the C++ Actor Framework (CAF), we considered a subset of the Parsec benchmarks and two Savina benchmarks. The analysis of results demonstrates that the Actor Model enriched with suitable Parallel Patterns implementations provides a robust abstraction layer capable of delivering performance results comparable with those of thread-based libraries (i.e. Pthreads and FastFlow) while offering a safer and versatile programming environment.
@article{ijpp_caf, author = {Rinaldi, Luca and Torquati, Massimo and De Sensi, Daniele and Mencagli, Gabriele and Danelutto, Marco}, title = {Improving the performance of Actors on Multi-Cores with Parallel Patterns}, journal = {International Journal of Parallel Programming}, year = {2020}, doi = {10.1007/s10766-020-00663-1}, url = {https://doi.org/10.1007/s10766-020-00663-1}, dimensions = {true}, }
PPAM
Application-Aware Power Capping Using Nornir

Daniele De Sensi, and Marco Danelutto

In Parallel Processing and Applied Mathematics, Aug 2020

Abs Bib PDF Slides

Power consumption of IT infrastructure is a major concern for data centre operators. Since data centres power supply is usually dimensioned for an average-case scenario, uncorrelated and simultaneous power spikes in multiple servers could lead to catastrophic effects such as power outages. To avoid such situations, power capping solutions are usually put in place by data centre operators, to control power consumption of individual server and to avoid the datacenter exceeding safe operational limits. However, most power capping solutions rely on Dynamic Voltage and Frequency Scaling (DVFS), which is not always able to guarantee the power cap specified by the user, especially for low power budget values. In this work, we propose a power-capping algorithm that uses a combination of DVFS and Thread Packing. We implement this algorithm in the Nornir framework and we validate it on some real applications by comparing it to the Intel RAPL power capping algorithm and another state of the art power capping algorithm.
@inproceedings{ppam2019, author = {De Sensi, Daniele and Danelutto, Marco}, editor = {Wyrzykowski, Roman and Deelman, Ewa and Dongarra, Jack and Karczewski, Konrad}, title = {Application-Aware Power Capping Using Nornir}, booktitle = {Parallel Processing and Applied Mathematics}, year = {2020}, publisher = {Springer International Publishing}, address = {Cham}, pages = {191--202}, isbn = {978-3-030-43222-5}, url = {https://link.springer.com/chapter/10.1007/978-3-030-43222-5_17}, doi = {10.1007/978-3-030-43222-5_17}, dimensions = {true}, }
Auto-DaSP19
Transparent Autonomicity for OpenMP Applications

Daniele De Sensi, and Marco Danelutto

In Euro-Par 2019: Parallel Processing Workshops, Aug 2020

Abs Bib PDF Slides

One of the key needs of an autonomic computing system is the ability to monitor the application performance with minimal intrusiveness and performance overhead. Several solutions have been proposed, differing in terms of effort required by the application programmers to add autonomic capabilities to their applications. In this work we extend the Nornir autonomic framework, allowing it to transparently monitor OpenMP applications thanks to the novel OpenMP Tools (OMPT) API. By using this interface, we are able to transparently transfer performance monitoring information from the application to the Nornir framework. This does not require any manual intervention by the programmer, which can seamlessly control an already existing application, enforcing any performance and/or power consumption requirement. We evaluate our approach on some real applications from the PARSEC and NAS benchmarks, showing that our solution introduces a negligible performance overhead, while being able to correctly control applications’ performance and power consumption.
@inproceedings{autodasp2019, author = {De Sensi, Daniele and Danelutto, Marco}, editor = {Schwardmann, Ulrich and Boehme, Christian and B. Heras, Dora and Cardellini, Valeria and Jeannot, Emmanuel and Salis, Antonio and Schifanella, Claudio and Manumachu, Ravi Reddy and Schwamborn, Dieter and Ricci, Laura and Sangyoon, Oh and Gruber, Thomas and Antonelli, Laura and Scott, Stephen L.}, title = {Transparent Autonomicity for OpenMP Applications}, booktitle = {Euro-Par 2019: Parallel Processing Workshops}, year = {2020}, publisher = {Springer International Publishing}, address = {Cham}, pages = {54--64}, isbn = {978-3-030-48340-1}, url = {https://link.springer.com/chapter/10.1007%2F978-3-030-48340-1_5}, doi = {10.1007/978-3-030-48340-1_5}, dimensions = {true}, }

2019

SC19
Mitigating Network Noise on Dragonfly Networks Through Application-aware Routing

Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Aug 2019

Abs arXiv Bib PDF Code Slides Video

System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates more traffic since packets traverse additional hops, causing in turn congestion on other applications and on the application itself. In this paper, we first describe how to estimate network noise. By following these guidelines, we show how noise can be reduced by using routing algorithms which select minimal paths with a higher probability. We exploit this knowledge to design an algorithm which changes the probability of selecting minimal paths according to the application characteristics. We validate our solution on microbenchmarks and real-world applications on two systems relying on a Dragonfly interconnection network, showing noise reduction and performance improvement.
@inproceedings{sc2019, author = {De Sensi, Daniele and Di Girolamo, Salvatore and Hoefler, Torsten}, title = {Mitigating Network Noise on Dragonfly Networks Through Application-aware Routing}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, series = {SC '19}, year = {2019}, isbn = {978-1-4503-6229-0}, location = {Denver, Colorado}, pages = {16:1--16:32}, articleno = {16}, numpages = {32}, url = {http://doi.acm.org/10.1145/3295500.3356196}, doi = {10.1145/3295500.3356196}, acmid = {3356196}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {dragonfly, network noise, routing}, openaccess = {https://dl.acm.org/authorize?N690645}, dimensions = {true}, }
HPCS19
Autonomic Management Experiences in Structured Parallel Programming

Marco Danelutto, Daniele De Sensi, Gabriele Mencagli, and 1 more author

In High Performance Computing Simulation (HPCS), 2019 International Conference on, Jul 2019

Abs Bib PDF

Structured parallel programming models based on parallel design patterns are gaining more and more importance. Several state-of-the-art industrial frameworks build on the parallel design pattern concept, including Intel TBB and Microsoft PPL. In these frameworks, the explicit exposition of parallel structure of the application favours the identification of the inefficiencies, the exploitation of techniques increasing the efficiency of the implementation and ensures that most of the more critical aspects related to an efficient exploitation of the available parallelism are moved from application programmers to framework designers. The very same exposition of the graph representing the parallel activities enables framework designers to emplace efficient autonomic management of non functional concerns, such as performance tuning or power management. In this paper, we discuss how autonomic management features evolved in different structured parallel programming frameworks based on the algorithmic skeletons and parallel design patterns. We show that different levels of autonomic management are possible, ranging from simple provisioning of mechanisms suitable to support programmers in the implementation of ad hoc autonomic managers to the complete autonomic managers whose behaviour may be programmed using high level rules by the application programmers.
@inproceedings{hpcs2019, author = {Danelutto, Marco and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo}, booktitle = {High Performance Computing Simulation (HPCS), 2019 International Conference on}, title = {Autonomic Management Experiences in Structured Parallel Programming}, year = {2019}, month = jul, address = {Dublin, Ireland}, url = {https://ieeexplore.ieee.org/document/9188228}, doi = {10.1109/HPCS48598.2019.9188228}, pages = {336-343}, dimensions = {true}, }
JSC
Simplifying and implementing service level objectives for stream parallelism

Dalvan Griebler, Adriano Vogel, Daniele De Sensi, and 2 more authors

The Journal of Supercomputing, Jun 2019

Abs Bib PDF

An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application’s source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.
@article{Griebler2019, author = {Griebler, Dalvan and Vogel, Adriano and De Sensi, Daniele and Danelutto, Marco and Fernandes, Luiz G.}, title = {Simplifying and implementing service level objectives for stream parallelism}, journal = {The Journal of Supercomputing}, year = {2019}, month = jun, day = {05}, pages = {1-26}, issn = {1573-0484}, doi = {10.1007/s11227-019-02914-6}, url = {https://doi.org/10.1007/s11227-019-02914-6}, dimensions = {true}, }
Access
GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs

Tiziano De Matteis, Gabriele Mencagli, Daniele De Sensi, and 2 more authors

IEEE Access, 2019

Abs Bib

Today’s stream processing systems handle high-volume data streams in an efficient manner. To achieve this goal, they are designed to scale out on large clusters of commodity machines. However, despite the efficient use of distributed architectures, they lack support to co-processors like graphical processing units (GPUs) ready to accelerate data-parallel tasks. The main reason for this lack of integration is that GPU processing and the streaming paradigm have different processing models, with GPUs needing a bulk of data present at once while the streaming paradigm advocates a tuple-at-a-time processing model. This paper contributes to fill this gap by proposing Gasser, a system for offloading the execution of sliding-window operators on GPUs. The system focuses on completely general functions by targeting the parallel processing of non-incremental queries that are not supported by the few existing GPU-based streaming prototypes. Furthermore, Gasser provides an auto-tuning approach able to automatically find the optimal value of the configuration parameters (i.e., batch length and the degree of parallelism) needed to optimize throughput and latency with the given query and data stream. The experimental part assesses the performance efficiency of Gasser by comparing its peak throughput and latency against Apache Flink, a popular and scalable streaming system. Furthermore, we evaluate the penalty induced by supporting completely general queries against the performance achieved by the state-of-the-art solution specifically optimized for incremental queries. Finally, we show the speed and accuracy of the auto-tuning approach adopted by Gasser, which is able to self-configure the system by finding the right configuration parameters without manual tuning by the users.
@article{8688411, author = {De Matteis, Tiziano and Mencagli, Gabriele and De Sensi, Daniele and Torquati, Massimo and Danelutto, Marco}, journal = {IEEE Access}, title = {GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs}, year = {2019}, volume = {7}, number = {}, pages = {48753-48769}, keywords = {graphics processing units;optimisation;parallel processing;query processing;popular streaming system;scalable streaming system;completely general queries;incremental queries;GASSER;auto-tunable system;general sliding-window streaming operators;stream processing systems;high-volume data streams;commodity machines;distributed architectures;coprocessors;performance efficiency;data stream;parallelism;configuration parameters;optimal value;graphical processing units;data-parallel tasks;GPU processing;different processing models;streaming paradigm advocates;tuple-at-a-time processing model;sliding-window operators;completely general functions;parallel processing;nonincremental queries;existing GPU-based;auto-tuning approach;Microsoft Windows;Graphics processing units;Parallel processing;Throughput;Windows;Task analysis;Prototypes;Data stream processing;sliding-window queries;GPU processing;autotuning;self-configuring systems}, doi = {10.1109/ACCESS.2019.2910312}, issn = {2169-3536}, month = {}, openaccess = {https://ieeexplore.ieee.org/document/8688411}, dimensions = {true} }
Repara
Service Level Objectives via C++11 Attributes

Dalvan Griebler, Daniele De Sensi, Adriano Vogel, and 2 more authors

In Euro-Par 2018: Parallel Processing Workshops, 2019

Abs Bib PDF Slides

In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be implemented for parallel applications since they can provide many control knobs (e.g., the number of threads to use, the clock frequency of the cores, etc.) to tune the performance and power consumption of the application. Different from most of the existing approaches, we target sequential stream processing applications by proposing a solution based on C++ annotations. The user specifies which parts of the code to parallelize and what type of requirements should be enforced on that part of the code. Our solution first automatically parallelizes the annotated code and then applies self-adaptation approaches at run-time to enforce the user-expressed objectives. We ran experiments on different real-world applications, showing its simplicity and effectiveness.
@inproceedings{spar:nornir, author = {Griebler, Dalvan and De Sensi, Daniele and Vogel, Adriano and Danelutto, Marco and Fernandes, Luiz Gustavo}, title = {Service Level Objectives via C++11 Attributes}, booktitle = {Euro-Par 2018: Parallel Processing Workshops}, year = {2019}, publisher = {Springer International Publishing}, address = {Cham}, pages = {745--756}, isbn = {978-3-030-10549-5}, url = {https://link.springer.com/chapter/10.1007%2F978-3-030-10549-5_58}, dimensions = {true}, }
Auto-DaSP18
Autonomic and Latency-Aware Degree of Parallelism Management in SPar

Adriano Vogel, Dalvan Griebler, Daniele De Sensi, and 2 more authors

In Euro-Par 2018: Parallel Processing Workshops, 2019

Abs Bib PDF

Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.
@inproceedings{spar:latency, author = {Vogel, Adriano and Griebler, Dalvan and De Sensi, Daniele and Danelutto, Marco and Fernandes, Luiz Gustavo}, title = {Autonomic and Latency-Aware Degree of Parallelism Management in SPar}, booktitle = {Euro-Par 2018: Parallel Processing Workshops}, year = {2019}, publisher = {Springer International Publishing}, address = {Cham}, pages = {28--39}, isbn = {978-3-030-10549-5}, url = {https://link.springer.com/chapter/10.1007%2F978-3-030-10549-5_3}, dimensions = {true}, }

2018

HPEC
Discovering k-Trusses in Large-Scale Networks

Alessio Conte, Daniele De Sensi, Roberto Grossi, and 2 more authors

In 2018 IEEE High Performance extreme Computing Conference (HPEC), Sep 2018

Abs Bib PDF IEEE HPEC Graph Challenge Finalist

A k-truss is a subgraph where every edge belongs to at least k-2 triangles in the subgraph. The truss decomposition assigns each edge the maximum k for which the edge belongs to a k-truss, and the trussness of a graph is the maximum among its edges. Discovery algorithms for k-trusses and truss decomposition provide useful insight for graph analytics (such as community detection). Even though they take polynomial time, on massive networks they suffer from handling a potentially cubic number of wedges: algorithms either need a long time to recompute triangles several times, have high memory usage, or rely on the large number of cores on graphic units. In this paper we describe EXTRUS, a highly optimized algorithm for truss decomposition which outperforms existing algorithms. We then introduce a faster algorithm, HYBTRUS, which finds the trussness of a graph using less time and space than EXTRUSS. Our algorithms take the best of existing approaches having good performance, low memory usage, and no need for sophisticated hardware systems. We compare our algorithms with the state-of-the-art on a set of real-world and synthetic networks. EXTRUSS processes graphs with over a billion edges, which seems difficult for the competitors, and our HYBTRUSS is the first algorithm able to find the trussness of a graph with over 25 billion edges.
@inproceedings{8547735, author = {Conte, Alessio and De Sensi, Daniele and Grossi, Roberto and Marino, Andrea and Versari, Luca}, booktitle = {2018 IEEE High Performance extreme Computing Conference (HPEC)}, title = {Discovering k-Trusses in Large-Scale Networks}, year = {2018}, volume = {}, number = {}, pages = {1-6}, keywords = {graph theory;network theory (graphs);trussness;large-scale networks;subgraph;truss decomposition assigns;discovery algorithms;graph analytics;polynomial time;massive networks;high memory usage;EXTRUSS;k-trusses;HYBTRUSS;Approximation algorithms;Data structures;Image edge detection;Hardware;Indexes;Switches;Informatics;k-trusses;truss decomposition;graph algorithms;HPEC 2018 graph challenge}, doi = {10.1109/HPEC.2018.8547735}, issn = {2377-6943}, award = {IEEE HPEC Graph Challenge Finalist}, month = sep, url = {https://ieeexplore.ieee.org/document/8547735}, dimensions = {true}, }
FGCS
Simplifying self-adaptive and power-aware computing with Nornir

Daniele De Sensi, Tiziano De Matteis, and Marco Danelutto

Future Generation Computer Systems, Sep 2018

Abs Bib PDF

Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources toallocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users’ activity. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of resources needed to process the incoming data. Implementing such kind of algorithms may be a complex task, due to low-level interactions with the underlying hardware and to non-intrusive and low-overhead monitoring of the applications. For these reasons, in this paper we propose Nornir, a C++-based framework, which can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be easily customized by algorithm designers to implement new self-adaptive policies. By instrumenting the applications in the {PARSEC} benchmark, we provide to strategy designers a wide set of applications already interfaced to Nornir. In addition to this, to prove its flexibility, we implemented and compared several state-of-the-art existing policies, showing that Nornir can also be used to easily analyze different algorithms and to provide useful insights on them.
@article{nornir:fgcs18, title = {Simplifying self-adaptive and power-aware computing with Nornir}, journal = {Future Generation Computer Systems}, volume = {}, number = {}, pages = { - }, year = {2018}, note = {}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2018.05.012}, url = {https://www.sciencedirect.com/science/article/pii/S0167739X17326699}, author = {De Sensi, Daniele and De Matteis, Tiziano and Danelutto, Marco}, keywords = {Self-adaptive, Power-aware, Quality of service, Data stream processing, Fog computing, Parallel computing}, dimensions = {true} }
KDD18
D2K: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes

Alessio Conte, Tiziano De Matteis, Daniele De Sensi, and 3 more authors

In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Sep 2018

Abs Bib PDF Video

This paper studies kplexes, a well known pseudo-clique model for network communities. In a kplex, each node can miss at most k-1 links. Our goal is to detect large communities in today’s real-world graphs which can have hundreds of millions of edges. While many have tried, this task has been elusive so far due to its computationally challenging nature: kplexes and other pseudo-cliques are harder to find and more numerous than cliques, a well known hard problem. We present D2K, which is the first algorithm able to find large kplexes of very large graphs in just a few minutes. The good performance of our algorithm follows from a combination of graph-theoretical concepts, careful algorithm engineering and a high-performance implementation. In particular, we exploit the low degeneracy of real-world graphs, and the fact that large enough kplexes have diameter 2. We validate a sequential and a parallel/distributed implementation of D2K on real graphs with up to half a billion edges.
@inproceedings{kdd:18, author = {Conte, Alessio and De Matteis, Tiziano and De Sensi, Daniele and Grossi, Roberto and Marino, Andrea and Versari, Luca}, title = {D2K: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes}, booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \&\#38; Data Mining}, series = {KDD '18}, year = {2018}, isbn = {978-1-4503-5552-0}, location = {London, United Kingdom}, pages = {1272--1281}, numpages = {10}, url = {http://doi.acm.org/10.1145/3219819.3220093}, doi = {10.1145/3219819.3220093}, acmid = {3220093}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {community discovery, graph enumeration, k-plexes, parallel programming}, openaccess = {https://dl.acm.org/authorize?N666390}, dimensions = {true} }
CCPE
Power-aware pipelining with automatic concurrency control

Massimo Torquati, Daniele De Sensi, Gabriele Mencagli, and 2 more authors

Concurrency and Computation: Practice and Experience, Sep 2018

e4652 cpe.4652

Abs Bib PDF

Continuous streaming computations are usually composed of different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (i.e. the "concurrency control") is a critical aspect both for performance and power consumption. In this paper we describe the design of automatic concurrency control algorithm for implementing power-efficient communications on shared-memory multicores. The algorithm automatically switches between "nonblocking" and "blocking" concurrency protocols, getting the best from the two worlds, i.e. obtaining the same throughput offered by the "nonblocking" implementation and the same power efficiency of the "blocking" concurrency protocol. We demonstrate the effectiveness of our approach using two micro-benchmarks and two real streaming applications.
@article{ccpe2018, author = {Torquati, Massimo and De Sensi, Daniele and Mencagli, Gabriele and Aldinucci, Marco and Danelutto, Marco}, title = {Power-aware pipelining with automatic concurrency control}, journal = {Concurrency and Computation: Practice and Experience}, volume = {0}, number = {0}, year = {2018}, pages = {e4652}, keywords = {blocking, concurrency control, data pipelining, data streams, multicores, power saving}, doi = {10.1002/cpe.4652}, url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.4652}, note = {e4652 cpe.4652}, dimensions = {true}, }
PDP18
Reducing Message Latency and CPU Utilization in the CAF Actor Framework

Massimo Torquati, Tullio Menga, Tiziano De Matteis, and 2 more authors

In 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), Mar 2018

Abs Bib PDF

In this work, we consider the C++ Actor Framework (CAF), a recent proposal that revamped the interest in building concurrent and distributed applicaions using the actor programming model in C++. CAF has been optimized for high-throughput computing, whereas message latency between actors is greatly influenced by the message data rate: at low and moderate rates the latency is higher than at high data rates. To this end, we propose a modification of the polling strategies in the work-stealing CAF scheduler, which can reduce message latency at low and moderate data rates up to two orders of magnitude without compromising the overall throughput and message latency at maximum pressure. The technique proposed uses a lightweight event notification protocol that is general enough to be used used to optimize the runtime of other frameworks experiencing similar issues.
@inproceedings{cafpdp18, author = {Torquati, Massimo and Menga, Tullio and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele}, booktitle = {2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)}, title = {Reducing Message Latency and CPU Utilization in the CAF Actor Framework}, year = {2018}, volume = {}, number = {}, pages = {145-153}, keywords = {C++ languages;Computational modeling;Message systems;Power demand;Programming;Runtime;Throughput;Actor model;CAF;message latency;multi-cores;polling strategies;work-stealing}, doi = {10.1109/PDP2018.2018.00028}, issn = {}, month = mar, dimensions = {true}, }
Auto-DaSP17
Nornir: A Customizable Framework for Autonomic and Power-Aware Applications

Daniele De Sensi, Tiziano De Matteis, and Marco Danelutto

In Euro-Par 2017: Parallel Processing Workshops, Mar 2018

Abs Bib PDF Slides

A desirable characteristic of modern parallel applications is the ability to dynamically select the amount of resources to be used to meet requirements on performance or power consumption. In many cases, providing explicit guarantees on performance is of paramount importance. In streaming applications, this is related with the concept of elasticity, i.e. being able to allocate the proper amount of resources to match the current demand as closely as possible. Similarly, in other scenarios, it may be useful to limit the maximum power consumption of an application to do not exceed the power budget. In this paper we propose Nornir, a customizable C++ framework for autonomic and power-aware parallel applications on shared memory multicore machines. Nornir can be used by autonomic strategy designers to implement new algorithms and by application users to enforce requirements on applications.
@inproceedings{nornir:autodasp17, author = {De Sensi, Daniele and De Matteis, Tiziano and Danelutto, Marco}, editor = {Heras, Dora B. and Bouge, Luc}, title = {Nornir: A Customizable Framework for Autonomic and Power-Aware Applications}, booktitle = {Euro-Par 2017: Parallel Processing Workshops}, year = {2018}, publisher = {Springer International Publishing}, pages = {42--54}, isbn = {978-3-319-75178-8}, doi = {10.1007/978-3-319-75178-8_4}, url = {https://link.springer.com/chapter/10.1007/978-3-319-75178-8_42}, dimensions = {true}, }

2017

SC17

Nornir: A Power-Aware Runtime Support for Parallel Applications

Daniele De Sensi, Marco Danelutto, and Massimo Torquati

In Supercomputing Doctoral Showcase, Nov 2017

Bib Poster Slides

@inproceedings{scphd:17,
  address = {Denver, Colorado, US},
  author = {De Sensi, Daniele and Danelutto, Marco and Torquati, Massimo},
  booktitle = {Supercomputing Doctoral Showcase},
  title = {Nornir: A Power-Aware Runtime Support for Parallel Applications},
  year = {2017},
  month = nov,
  pages = {},
  dimensions = {true},
}

IJPP
The RePhrase Extended Pattern Set for Data Intensive Parallel Computing

Marco Danelutto, Tiziano De Matteis, Daniele De Sensi, and 4 more authors

International Journal of Parallel Programming, Nov 2017

Abs Bib

We discuss the extended parallel pattern set identified within the EU-funded project RePhrase as a candidate pattern set to support data intensive applications targeting heterogeneous architectures. The set has been designed to include three classes of pattern, namely (1) core patterns, modelling common, not necessarily data intensive parallelism exploitation patterns, usually to be used in composition; (2) high level patterns, modelling common, complex and complete parallelism exploitation patterns; and (3) building block patterns, modelling the single components of data intensive applications, suitable for use—in composition—to implement patterns not covered by the core and high level patterns. We discuss the expressive power of the RePhrase extended pattern set and results illustrating the performances that may be achieved with the FastFlow implementation of the high level patterns.
@article{rephrase:ijpp17, author = {Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo and Aldinucci, Marco and Kilpatrick, Peter}, title = {The RePhrase Extended Pattern Set for Data Intensive Parallel Computing}, journal = {International Journal of Parallel Programming}, year = {2017}, month = nov, day = {28}, issn = {1573-7640}, doi = {10.1007/s10766-017-0540-z}, openaccess = {http://rdcu.be/zN6c}, url = {https://doi.org/10.1007/s10766-017-0540-z}, pages = {74–93}, dimensions = {true}, }
ParCo17
State-Aware Concurrency Throttling

Daniele De Sensi, Peter Kilpatrick, and Massimo Torquati

In Proceedings of International Parallel Computing Conference (ParCo), Nov 2017

Abs Bib PDF Slides

Reconfiguration of parallel applications has gained traction with the increasing emphasis on energy/performance trade-off. The ability to dynamically change the amount of resources used by an application allows reaction to changes in the environment, in the application behavior or in the user’s requirements. A popular technique consists in changing the number of threads used by the application (Dynamic Concurrency Throttling). Although this provides good control of application performance and power consumption, managing the technique can impose a significant burden on the application programmer, mainly due to state management and redistribution following the addition or removal of a thread. Nevertheless, some common state access patterns have been identified in some popular applications. By leveraging on this knowledge, we will describe how it is possible to simplify the state management procedures following a Concurrency Throttling operation.
@inproceedings{stateawarethrottling, address = {Bologna, Italy}, author = {De Sensi, Daniele and Kilpatrick, Peter and Torquati, Massimo}, booktitle = {Proceedings of International Parallel Computing Conference ({ParCo})}, keywords = {Power-Aware Computing, Concurrency Throttling, Data Stream Processing}, title = {State-Aware Concurrency Throttling}, pages = {201--210}, year = {2017}, doi = {10.3233/978-1-61499-843-3-201}, url = {http://ebooks.iospress.nl/volumearticle/48609}, dimensions = {true}, }
SoftwareX
Mammut: High-level management of system knobs and sensors

Daniele De Sensi, Massimo Torquati, and Marco Danelutto

SoftwareX, Jul 2017

Abs Bib

Managing low-level architectural features for controlling performance and power consumption is a growing demand in the parallel computing community. Such features include, but are not limited to: energy profiling, platform topology analysis, CPU cores disabling and frequency scaling. However, these low-level mechanisms are usually managed by specific tools, without any interaction between each other, thus hampering their usability. More important, most existing tools can only be used through a command line interface and they do not provide any API. Moreover, in most cases, they only allow monitoring and managing the same machine on which the tools are used. Mammut provides and integrates architectural management utilities through a high-level and easy-to-use object-oriented interface. By using Mammut, is possible to link together different collected information and to exploit them on both local and remote systems, to build architecture-aware applications.
@article{mammut:softwarex, title = {Mammut: High-level management of system knobs and sensors}, journal = {SoftwareX}, volume = {6}, number = {}, pages = {150 - 154}, year = {2017}, month = jul, note = {}, issn = {2352-7110}, doi = {http://dx.doi.org/10.1016/j.softx.2017.06.005}, url = {http://www.sciencedirect.com/science/article/pii/S2352711017300225}, openaccess = {http://www.sciencedirect.com/science/article/pii/S2352711017300225}, author = {De Sensi, Daniele and Torquati, Massimo and Danelutto, Marco}, keywords = {Remote management}, dimensions = {true}, }
TACO
Bringing Parallel Patterns Out of the Corner: The P^3ARSEC Benchmark Suite

Daniele De Sensi, Tiziano De Matteis, Massimo Torquati, and 2 more authors

ACM Trans. Archit. Code Optim., Oct 2017

Abs Bib PDF Poster

High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time to solution. Pattern-based parallel programming is based on a set of composable and customizable parallel patterns used as basic building blocks in parallel applications. In recent years, a considerable effort has been made in empowering this programming model with features able to overcome shortcomings of early approaches concerning flexibility and performance. In this article, we demonstrate that the approach is flexible and efficient enough by applying it on 12 out of 13 PARSEC applications. Our analysis, conducted on three different multicore architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e., Openmp and OmpSs) and native implementations (i.e., Pthreads). Regarding the programming effort, we also demonstrate a considerable reduction in lines of code and code churn compared to Pthreads and comparable results with respect to other existing implementations.
@article{p3arsec:taco17, author = {De Sensi, Daniele and De Matteis, Tiziano and Torquati, Massimo and Mencagli, Gabriele and Danelutto, Marco}, title = {Bringing Parallel Patterns Out of the Corner: The P$^{3}$ARSEC Benchmark Suite}, journal = {ACM Trans. Archit. Code Optim.}, issue_date = {October 2017}, volume = {14}, number = {4}, month = oct, year = {2017}, issn = {1544-3566}, pages = {33:1--33:26}, articleno = {33}, numpages = {26}, url = {http://doi.acm.org/10.1145/3132710}, openaccess = {http://dl.acm.org/authorize?N49996}, doi = {10.1145/3132710}, acmid = {3132710}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Parallel patterns, algorithmic skeletons, benchmarking, multicore programming, parsec}, dimensions = {true}, }
PPL
A Power-Aware, Self-Adaptive Macro Data Flow Framework

Marco Danelutto, Daniele De Sensi, and Massimo Torquati

Parallel Processing Letters, Mar 2017

Abs Bib PDF Slides

The dataflow programming model has been extensively used as an effective solution to implement efficient parallel programming frameworks. However, the amount of resources allocated to the runtime support is usually fixed once by the programmer or the runtime, and kept static during the entire execution. While there are cases where such a static choice may be appropriate, other scenarios may require to dynamically change the parallelism degree during the application execution. In this paper we propose an algorithm for multicore shared memory platforms, that dynamically selects the optimal number of cores to be used as well as their clock frequency according to either the workload pressure or to explicit user requirements. We implement the algorithm for both structured and unstructured parallel applications and we validate our proposal over three real applications, showing that it is able to save a significant amount of power, while not impairing the performance and not requiring additional effort from the application programmer.
@article{nornir:ppl17, author = {Danelutto, Marco and De Sensi, Daniele and Torquati, Massimo}, title = {A Power-Aware, Self-Adaptive Macro Data Flow Framework}, journal = {Parallel Processing Letters}, volume = {27}, number = {01}, pages = {1740004}, year = {2017}, month = mar, doi = {10.1142/S0129626417400047}, url = {http://www.worldscientific.com/doi/abs/10.1142/S0129626417400047}, dimensions = {true}, }
PDP17
Evaluating Concurrency Throttling and Thread Packing on SMT Multicores

Marco Danelutto, Tiziano De Matteis, Daniele De Sensi, and 1 more author

In Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2017, Mar 2017

Abs Bib PDF Slides

Power-aware computing is gaining an increasing attention both in academic and industrial settings. The problem of guaranteeing a given QoS requirement (either in terms of performance or power consumption) can be faced by selecting and dynamically adapting the amount of physical and logical resources used by the application. In this study, we considered standard multicore platforms by taking as a reference approaches for power-aware computing two well-known dynamic reconfiguration techniques: Concurrency Throttling and Thread Packing. Furthermore, we also studied the impact of using simultaneous multithreading (e.g., Intel’s HyperThreading) in both techniques. In this work, leveraging on the applications of the PARSEC benchmark suite, we evaluate these techniques by considering performance-power trade-offs, resource efficiency, predictability and required programming effort. The results show that, according to the comparison criteria, these techniques complement each other.
@inproceedings{cttp:pdp17, author = {Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Torquati, Massimo}, title = {Evaluating Concurrency Throttling and Thread Packing on SMT Multicores}, booktitle = {Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2017}, location = {St. Petersburg, Russia}, year = {2017}, doi = {10.1109/PDP.2017.39}, pages = {219-223}, url = {http://ieeexplore.ieee.org/document/7912648/}, dimensions = {true}, }
SAC17
P^3ARSEC: Towards Parallel Patterns Benchmarking

Marco Danelutto, Tiziano De Matteis, Daniele De Sensi, and 2 more authors

In Proceedings of the 32nd Annual ACM Symposium on Applied Computing, Mar 2017

Abs Bib PDF Slides

High-level parallel programming is a de-facto standard approach to develop parallel software with reduced time to development. High-level abstractions are provided by existing frameworks as pragma-based annotations in the source code, or through pre-built parallel patterns that recur frequently in parallel algorithms, and that can be easily instantiated by the programmer to add a structure to the development of parallel software. In this paper we focus on this second approach and we propose P3ARSEC, a benchmark suite for parallel pattern-based frameworks consisting of a representative subset of PARSEC applications. We analyse the programmability advantages and the potential performance penalty of using such high-level methodology with respect to hand-made parallelisations using low-level mechanisms. The results are obtained on the new Intel Knights Landing multicore, and show a significantly reduced code complexity with comparable performance.
@inproceedings{p3arsec:sac17, author = {Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo}, title = {P$^{3}$ARSEC: Towards Parallel Patterns Benchmarking}, isbn = {978-1-4503-4486-9}, pages = {1582--1589}, numpages = {8}, doi = {10.1145/3019612.3019745}, acmid = {3019745}, booktitle = {Proceedings of the 32nd Annual ACM Symposium on Applied Computing}, series = {SAC '17}, year = {2017}, location = {Marrakesh, Morocco}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Parallel Patterns, PARSEC Benchmarks, Intel KNL}, url = {http://doi.acm.org/10.1145/3019612.3019745}, openaccess = {http://dl.acm.org/authorize?N34889}, dimensions = {true}, }

2016

TACO
A Reconfiguration Algorithm for Power-Aware Parallel Applications

Daniele De Sensi, Massimo Torquati, and Marco Danelutto

ACM Transactions on Architecture and Code Optimization, Dec 2016

Abs Bib PDF Slides Video

In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet maintaining an acceptable level, in order to reduce their power consumption. To provide such guarantees, a possible solution consists in changing the number of cores assigned to the application, their clock frequency and the placement of application threads over the cores. However, power consumption and performance have different trends depending on the application considered and on its input. Finding a configuration of resources satisfying user requirements is in the general case a challenging task. In this paper we propose Nornir, an algorithm to automatically derive, without relying on historical data about previous executions, performance and power consumption models of an application in different configurations. By using these models, we are able to select a close to optimal configuration for the given user requirement, either performance or power consumption. The configuration of the application will be changed on-the-fly throughout the execution to adapt to workload fluctuations, external interferences and/or application’s phase changes. We validate the algorithm by simulating it over the applications of the PARSEC benchmark suite. Then, we implement our algorithm and we analyse its accuracy and overhead over some of these applications on a real execution environment. Eventually, we compare the quality of our proposal with that of the optimal algorithm and of some state of the art solutions.
@article{nornir:taco16, author = {De Sensi, Daniele and Torquati, Massimo and Danelutto, Marco}, title = {A Reconfiguration Algorithm for Power-Aware Parallel Applications}, journal = {ACM Transactions on Architecture and Code Optimization}, issue_date = {December 2016}, volume = {13}, number = {4}, month = dec, year = {2016}, issn = {1544-3566}, pages = {43:1--43:25}, articleno = {43}, numpages = {25}, url = {http://doi.acm.org/10.1145/3004054}, openaccess = {http://dl.acm.org/authorize?N34888}, doi = {10.1145/3004054}, acmid = {3004054}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {DVFS, Power-aware computing, dynamic concurrency throttling, multi-core, online learning, power capping, self-adaptive runtime}, dimensions = {true}, }
IJPP
Analysing Multiple QoS Attributes in Parallel Design Patterns-Based Applications

Antonio Brogi, Marco Danelutto, Daniele De Sensi, and 3 more authors

International Journal of Parallel Programming, Nov 2016

Abs Bib PDF

Parallel design patterns can be fruitfully combined to develop parallel software applications. Different combinations of patterns can feature different QoS while being functionally equivalent. To support application developers in selecting the best combinations of patterns to develop their applications, we hereby propose a probabilistic approach that permits analysing, at design time, multiple QoS attributes of parallel design patterns-based application. We also present a proof-of-concept implementation of our approach, together with some experimental results.
@article{pasa:ijpp16, author = {Brogi, Antonio and Danelutto, Marco and De Sensi, Daniele and Ibrahim, Ahmad and Soldani, Jacopo and Torquati, Massimo}, title = {Analysing Multiple QoS Attributes in Parallel Design Patterns-Based Applications}, journal = {International Journal of Parallel Programming}, year = {2016}, month = nov, pages = {1--20}, issn = {1573-7640}, doi = {10.1007/s10766-016-0476-8}, openaccess = {http://rdcu.be/yK2u}, url = {http://dx.doi.org/10.1007/s10766-016-0476-8}, dimensions = {true}, }
ACACES16
Power aware reconfigurations of parallel applications

Daniele De Sensi, Marco Danelutto, and Massimo Torquati

In Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) – Poster abstracts, Jul 2016

Abs Bib PDF Poster

Current architectures provide many possibilities for the reduction of power consumption of applications, such as reducing the number of used cores or scaling down their frequency. However, the amount of resources allocated to an application is usually static and fixed by the programmer or by the runtime. While there are cases where such a static choice may be appropriate, other scenarios may require to dynamically change the amount of resources during the application execution. Choosing the right amount of resources to use in order to satisfy requirements on performance and/or power consumption is a complex task and testing all the possible configurations is an unfeasible solution since it would require too much time. We show some solutions to this problem that, by acting on the number of cores used by the application an on the frequency of these cores are able to provide guarantees on maximum power consumption or on a minimum performance level. We then outline the main results achieved by applying these techniques to some real applications.
@inproceedings{acaces:16, address = {Fiuggi, Italy}, author = {De Sensi, Daniele and Danelutto, Marco and Torquati, Massimo}, isbn = {978-88-905806-4-2}, keywords = {power aware; self adaptive; parallel applications;}, publisher = {HiPEAC}, pages = {141 -- 144}, title = {Power aware reconfigurations of parallel applications}, year = {2016}, month = jul, dimensions = {true}, }
PDP16
Predicting Performance and Power Consumption of Parallel Applications

Daniele De Sensi

In Proceedings of 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), Feb 2016

Abs Bib PDF Slides

Current architectures provide many control knobs for the reduction of power consumption of applications, like reducing the number of used cores or scaling down their frequency. However, choosing the right values for these knobs in order to satisfy requirements on performance and/or power consumption is a complex task and trying all the possible combinations of these values is an unfeasible solution since it would require too much time. For this reasons, there is the need for techniques that allow an accurate estimation of the performance and power consumption of an application when a specific configuration of the control knobs values is used. Usually, this is done by executing the application with different configurations and by using these information to predict its behaviour when the values of the knobs are changed. However, since this is a time consuming process, we would like to execute the application in the fewest number of configurations possible. In this work, we consider as control knobs the number of cores used by the application and the frequency of these cores. We show that on most Parsec benchmark programs, by executing the application in 1% of the total possible configurations and by applying a multiple linear regression model we are able to achieve an average accuracy of 96% in predicting its execution time and power consumption in all the other possible knobs combinations.
@inproceedings{models:pdp:16, author = {De Sensi, Daniele}, booktitle = {Proceedings of 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing ({PDP})}, title = {Predicting Performance and Power Consumption of Parallel Applications}, year = {2016}, pages = {200 -- 207}, url = {http://ieeexplore.ieee.org/document/7445331/}, keywords = {Power-aware computing; Regression analysis; PARSEC benchmark; Control knobs; DVFS; Predictive models; Concurrency throttling;}, doi = {10.1109/PDP.2016.41}, month = feb, dimensions = {true}, }

2015

PDP15
Energy driven adaptivity in stream parallel computations

Marco Danelutto, Daniele De Sensi, and Massimo Torquati

In Proceedings of 23th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), Feb 2015

Abs Bib PDF Slides

Determining the right amount of resources needed for a given computation is a critical problem. In many cases, computing systems are configured to use an amount of resources to manage high load peaks even though this cause energy waste when the resources are not fully utilised. To avoid this problem, adaptive approaches are used to dynamically increase/decrease computational resources depending on the real needs. A different approach based on Dynamic Voltage and Frequency Scaling (DVFS) is emerging as a possible alternative solution to reduce energy consumption of idle CPUs by lowering their frequencies. In this work, we propose to tackle the problem in stream parallel computations by using both the classic adaptivity concepts and the possibility provided by modern CPUs to dynamically change their frequency. We validate our approach showing a real network application that performs Deep Packet Inspection over network traffic. We are able to manage bandwidth changing over time, guaranteeing minimal packet loss during reconfiguration and minimal energy consumption.
@inproceedings{ff:energy:pdp:15, author = {Danelutto, Marco and De Sensi, Daniele and Torquati, Massimo}, title = {Energy driven adaptivity in stream parallel computations}, booktitle = {Proceedings of 23th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing ({PDP})}, year = {2015}, address = {Turku, Finland}, publisher = {IEEE}, pages = {103 -- 110}, doi = {10.1109/PDP.2015.92}, issn = {1066-6192}, url = {http://ieeexplore.ieee.org/document/7092707/}, date-added = {2015-02-28 10:59:38 +0000}, date-modified = {2015-02-28 11:01:23 +0000}, keywords = {fastflow, stream parallel, energy consumption, power-aware computing}, dimensions = {true}, }

2013

ParCo13
Deep Packet Inspection on Commodity Hardware using FastFlow

Marco Danelutto, Luca Deri, Daniele De Sensi, and 1 more author

In Proceedings of 15th International Parallel Computing Conference (ParCo), Feb 2013

Abs Bib PDF

The analysis of packet payload is mandatory for network security and traffic monitoring applications. The computational cost of this activity pushed the industry towards hardware-assisted deep packet inspection (DPI) that have the disadvantage of being more expensive and less flexible. This paper covers the design and implementation of a new DPI framework using FastFlow, a skeleton-based parallel programming library targeting efficient streaming on multi-core architectures. The experimental results demonstrate the efficiency of the DPI framework proposed, proving the feasibility to perform 10Gbit DPI analysis using modern commodity hardware.
@inproceedings{ff:DPI:14, author = {Danelutto, Marco and Deri, Luca and De Sensi, Daniele and Torquati, Massimo}, title = {Deep Packet Inspection on Commodity Hardware using FastFlow}, booktitle = {Proceedings of 15th International Parallel Computing Conference ({ParCo})}, year = {2013}, editor = {Bader, Michael and Bode, Arndt and Bungartz, Hans-Joachim and Gerndt, Michael and Joubert, Gerhard R. and Peters, Frans}, volume = {25}, series = {Advances in Parallel Computing}, pages = {92 -- 99}, address = {Munich, Germany}, publisher = {IOS Press}, doi = {10.3233/978-1-61499-381-0-92}, keywords = {fastflow, dpi, network monitoring}, url = {http://ebooks.iospress.nl/publication/35869}, dimensions = {true}, }

2011

ParCo11
Network Monitoring on Multicores with Algorithmic Skeletons

Marco Danelutto, Luca Deri, and Daniele De Sensi

In Proceedings of 14th Inernational Parallel Computing Conference (ParCo), Feb 2011

Abs Bib PDF Slides

Monitoring network traffic on 10 Gbit networks requires very efficient tools capable of exploiting modern multicore computing architectures. Specialized network cards can accelerate packet capture and thus reduce the processing overhead, but they can not achieve adequate packet analysis performance. For this reason most monitoring tools cannot cope with high network speeds. We describe the design and implementation of ffProbe, a network traffic monitoring application built on top of FastFlow, combined with several optimized parallel programming patterns. We compare ffProbe with two popular network monitoring probes. The results demonstrate that it can scale significantly better with number of cores and thus may be suitable for monitoring 10 Gbit networks using commodity servers.
@inproceedings{DBLP:conf/parco/DaneluttoDS11, author = {Danelutto, Marco and Deri, Luca and De Sensi, Daniele}, title = {Network Monitoring on Multicores with Algorithmic Skeletons}, booktitle = {Proceedings of 14th Inernational Parallel Computing Conference ({ParCo})}, pages = {519 -- 526}, year = {2011}, crossref = {DBLP:conf/parco/2011}, url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.231.9143}, doi = {10.3233/978-1-61499-041-3-519}, timestamp = {Tue, 28 Apr 2015 15:53:30 +0200}, dimensions = {true}, }