Loading Events
  • This event has passed.

Technical Paper Session B: In-Network Optimizations

August 23 @ 12:30 pm - 1:30 pm UTC-7

Technical Paper Session B: In-Network Optimizations

Session Chair: Sourav Chakraborty (Samsung)

In-Network Compression for Accelerating IoT Analytics at Scale.

Authors: Rafael Oliveira and Ada Gavrilovska (GA Tech).

To enable Internet of Things (IoT) to scale at the level of next generation smart cities and grids, there is a need for cost-effective infrastructure for hosting IoT analytics workloads. Offload and acceleration via SmartNICs have been shown to provide benefits to these workloads. However, even with offload, long-term analysis on IoT data still needs to operate on massive number of device updates, often in the form of small messages. Despite offloading, the ingestion of these updates continues to present server bottlenecks. In this paper, we present domain-specific compression and batching engines, that leverage the unique properties of IoT messages to reduce the load on analytics servers and improve their scalability. Using a prototype system based on the InnovaFlex programmable SmartNICs, and several representative IoT benchmarks, we demonstrate that the combination of these techniques achieves up to 14.5x improvement in sustained throughput rates compared to a system without SmartNIC offload, and up to 7x improvement over existing offload approaches.

Designing In-network Computing Aware Reduction Collectives in MPI.

Authors: Bharath Ramesh, Goutham Kalikrishna Reddy Kuncham, Kaushik Kandadi Suresh, Rahul Vaidya, Nawras Alnaasan, Mustafa Abduljabbar, Aamir Shafi, Hari Subramoni and Dhabaleswar Panda (OSU).

Abstract: The Message-Passing Interface (MPI) provides convenient abstractions such as MPI_Allreduce for inter-process collective reduction operations. With the advent of deep learning and large scale HPC systems, it is ever so important to optimize the latency of the MPI_Allreduce operation for large messages. Due to the amount of compute and communication involved in MPI_Allreduce, it is beneficial to offload collective computation/communication to the network to allow the CPU to work on other important operations and provide maximal overlap/scalability. NVIDIA’s HDR Infiniband switches provide in-network computing features using the Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) for this purpose with two protocols targeted at different message ranges : 1) Local Latency Tree (LLT) for small messages, and 2) Streaming aggregation tree (SAT) for large messages. In this paper, we first analyze the overheads involved in using SHARP based reductions with SAT in an MPI library using micro-benchmarks. Next, we propose designs for large message MPI_Allreduce by fully utilizing the capabilities provided by the SHARP runtime whilst overcoming various bottlenecks. The efficacy of our proposed designs are demonstrated using micro-benchmark results. We observe up to 89\% improvements over MVAPICH2-X and HPC-X for large message reductions.