Loading Events

Technical Paper Session B: Communication Technologies

August 17 @ 12:45 pm - 1:45 pm UTC-7

12:45 – 13:15: Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries
Authors: Kaushik Kandadi Suresh, Kawthar Shafie Khorassani, Chen-Chun Chen, Bharath Ramesh, Mustafa Abduljabbar, Aamir Shafi, Hari Subramoni and Dhabaleswar Panda

The importance of GPUs in accelerating HPC applications are evident by the fact that large number of super-computing clusters are GPU-enabled. Many of these HPC applications use MPI as their programming model. These MPI applications exchange oftentimes exchange data that is non-contiguous in the GPU memory. MPI provides Derived Datatypes(DDTs) to represent such data. In the past several researchers have proposed various solutions to optimize these MPI DDT based inter-node GPU exchanges. All of these solutions are aimed at optimizing the overheads associated with pack-unpack kernels that used to facilitate the non-contiguous exchanges. Modern HCAs are capable of gathering/scattering data from/to non-contiguous GPU memory regions. In this work, we analyze the challenges in using HCA’s scatter/gather mechanism for GPU-based HPC workloads. We propose low-overhead HCA assisted scheme to improve the performance of GPU-based non-contiguous exchanges. We show that the proposed scheme provides up to 2X benefits compared to existing pack-based schemes at the benchmark level. Furthermore, on the layouts used by MILC, NASMG, Specfem3D applications, we show that the proposed scheme outperforms the state-of-the MPI libraries such as MVAPICH2-GDR, OpenMPI.

13:15 – 13:45: Updatable Packet Classification on FPGA with Bounded Worst-Case Performance
Authors: Yao Xin, Wenjun Li, Gaogang Xie, Yang Xu and Yi Wang

FPGA has been recognized as an attractive accelerator for line-speed packet classification in SmartNIC, due to its ability to reconfigure and provide massive parallelism. As a promising algorithmic approach that can fully exploit the characteristics of FPGA, decision tree based packet classification on FPGA has been actively investigated in the past decade. However, most of them suffer from unbalanced tree structure with unpredictable depth under certain rule sets, so the potential of FPGA may not be brought into full play. Worse still, few of them can support efficient rule update on-the-fly, which is highly required in virtualized data centers. To address these issues, we design and implement an efficient hardware architecture based on the recently proposed KickTree algorithm, which consists of multiple balanced trees with bounded depth. On the whole, a strategy of multi-PE (processing element), parallel-search and serial-update is adopted to decouple search and update process. The parsing of multiple tree search results adopts a modular and hierarchical design, supporting architecture with various tree numbers. Additionally, incremental rule update can be achieved simply by traversing all PEs in one pass, with little and bounded impact on rule searching. Experimental results on FPGA show that our design can achieve an average classification throughput of 182.6 MPPS and an average update throughput of 3.1 MUPS for various 100k-scale rule sets.