- This event has passed.
Technical Paper Session C: Smart NICs
Technical Paper Session C: Smart NICs
Session Chair: Arpan Jain (Microsoft)
Characterizing Lossy and Lossless Compression on Emerging BlueField DPU Architectures.
Authors: Yuke Li (UC Merced), Arjun Kashyap (UC Merced), Yanfei Guo (ANL) and Xiaoyi Lu (UC Merced).
Abstract: The Data Processing Unit (DPU) (i.e., programmable SmartNICs with System-on-Chip or SoC cores) has emerged as a valuable supplementary resource to the host CPU. The DPU architecture has been attracting significant attention within High-Performance Computing (HPC) and data center clusters due to its advanced capabilities and accelerators, which include a hardware-based data compression engine. This positions the DPU as a prospective tool for accelerating and offloading compression workloads from the hosts, which can potentially speed up data-intensive applications. The convergence of Big Data, HPC, and Machine Learning (ML) systems has rendered large data volumes a major performance bottleneck in message communication and data storage. While compression can boost performance, recent studies reveal that compression techniques (e.g., lossy and lossless) are compute-intensive and time-consuming, particularly with larger data sizes. Consequently, this paper characterizes the performance of three lossy (SZ3) and lossless (DEFLATE and zlib) compression algorithms with seven real-world data sets on the popular NVIDIA’s BlueField-2 DPUs to explore potential opportunities for offloading these workloads from the host. We find that compared to DPU’s SoC cores, DPU’s hardware compression engine can obtain up to 26.8x performance speedup. Furthermore, we discuss the challenges and opportunities associated with employing NVIDIA’s BlueField DPUs to accelerate lossy and lossless compression/decompression workloads. Our research discloses five important takeaways which shed light on future research directions for lossy and lossless compressions on DPUs.
Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs.
Authors: Benjamin Michalowicz (OSU), Kaushik Kandadi Suresh (OSU), Hari Subramoni (OSU), Dhabaleswar Panda (OSU) and Stephen Poole (LANL).
Abstract: Over the past several years, Smart Network Interface Cards (NIC/SmartNICs) have rapidly evolved in popularity. In particular, NVIDIA’s BlueField line of SmartNICs has been effective in a wide variety of uses: Offloading communication in High-Performance Computing applications (HPC), various stages of the Deep Learning (DL) pipeline, and is designed especially for Datacenter/virtualization uses. The BlueField-3 DPU was released at the end of 2022 as a follow-up to its widely accepted BlueField-2 predecessor, and this work will serve as an in-depth performance evaluation between the two to show a) a comparison
of both SmartNICs’ on-chip capabilities (memory bandwidth, compute speed, etc.), and b) their offload capabilities through a number of micro/benchmarks and applications. In single-DPU programs, we see up to 61% improvements in the latency of a memcpy operation and up to 82% bandwidth improvement in the use of the STREAM benchmark [8] on the BlueField-3. With the use of a DPU-aware MPI library [1], we observe approximately over 30% improvement at the benchmark level when comparing staging-based designs on both SmartNICs and up to nearly double that in the context of an application. However, GVMI (Guest Virtual Machine ID) based designs contained in said library do not exceed 10% at the benchmark level and provide less than 2% benefits in applications because of its architecture-insensitive nature — that is, while CPU clock speed may impact the completion time of instructions, the performance of the GVMI-based designs in a DPU-aware MPI library will largely be unaffected by swapping the BlueField-2 for a BlueField-3.