||Thursday, August 15 (Symposium)
||Breakfast and Registration
Don Draper & Eitan Zahavi General Chairs
||Host Opening Remarks
Mike Zeile, Intel
From Microns to Miles - The Broad Spectrum of Intel's Interconnect Technology Strategy
Uri Cummings, CTO of DCG Connectivity Group, Intel
The First Supercomputer with HyperX Topology: A Viable Alternative to Fat-Trees?
J.Domke, S. Matsuoka, I. R. Ivanov*, Y. Tsushima*, T. Yuki*, A. Nomura*, S. Miura*, N. McDonald†, D. L. Floyd†,
and N. Dubé†
The state-of-the-art topology for modern supercomputers are Folded Clos networks, a.k.a. Fat-Trees. The node count in these massively parallel systems is
steadily increasing. This forces an increased path length, which limits gains for latency-sensitive applications, because the port count of modern switches
cannot be scaled accordingly. Additionally, the deployment methodology for today's Fat-Trees requires the extensive use of costly active optical cables. A novel,
yet only theoretically investigated, alternative is the low-diameter HyperX. To perform a fair side-by-side comparison between a 3-level Fat-Tree and a 12x8 HyperX,
we constructed the world's Žrst 3 Pßop/s supercomputer with these two networks. We show through a variety of benchmarks that the HyperX, together with our novel
communication pattern-aware routing, can challenge the performance of traditional Fat-Trees.
RIKEN Center for Computational Science (R-CCS), Japan
Tokyo Institute of Technology*, Japan
Hewlett Packard Enterprise (HPE)†, USA
RPath2SL: Optimizing Head-of-Line Blocking Reduction in InfiniBand-based Fat-tree Networks
G. Maglione-Mathey, J. Escudero-Sahuquillo, P. J. Garcia, F. J. Quiles, and J. Duato
The interconnection network is a key element in high-performance computing (HPC) and Datacenter (DC) systems, as it must support the communication among the endnodes,
whose number constantly increases. Hence, guaranteeing a suitable network performance is crucial, as otherwise the network would become the entire system bottleneck. Network
performance depends on several design issues: topology and routing, switch architecture, interconnect technology, etc. Among the available interconnect technologies, InfiniBand is
a prominent one. Infiniband components and control software allow to implement efŽcient topologies and routing algorithms, as well as queuing schemes that reduce the Head-of-Line (HoL)
blocking effect derived from congestion situations. In this paper we present a new queuing scheme called Path2SL, that optimizes the use of the InŽniBand Virtual Lanes (VLs) to reduce
HoL blocking in Fat-Tree network topologies. We have implemented PathSL in the control software of a real Infiniband-based cluster. The experiment results obtained from real workloads
run in this cluster show that Path2SL is a more efficient queuing scheme than others previously proposed to deal with HoL blocking in the analysed network conŽgurations.
Universidad de Castilla-La Mancha, Spain
High-Quality Fault-Resiliency in Fat-Tree Networks
J. Gliksberg, A. Capra*, A. Louvet*, P. J. Garcia†, and D. Sohier
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a
fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure.
It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete re-routing
of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running
applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time
to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation
(RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal
SP congestion risk up to 1% of random degradation.
Versailles Saint-Quentin-en-Yvelines University (UVSQ), France
Universidad de Castilla-La Mancha†, Spain
Versal Network-on-Chip (NoC)
I. Swarbrick, D. Gaitonde, S. Ahmad, B. Jayadev, J. Cuppett, A. Morshed,
B. Gaide, and Y. Arbel
Xilinx Versal Adaptable Compute Acceleration Platform (ACAP) is a new software-programmable heterogenous compute platform. The slowing of Moores
law and the ever-present need for higher levels of compute performance has spurred the development of many domain speciŽc accelerator architectures.
ACAP devices are well suited to take advantage of this trend. They provide a combination of hardened heterogenous compute and IO elements and programmable
logic. Programmable logic allows the accelerator to be customized in order to accelerate the whole application. The Versal Network-on-Chip (NoC) is a
programmable resource that interconnects all of these elements. This paper outlines the motivation for a hardened NoC within a programmable accelerator
platform and described the Versal NoC.
Xilinx Inc, USA
Compute Express Link
S. Van Doren
High Capacity On-Package Physical Link Considerations
G. Taylor, R. Farjadrad*, and B. Vinnakota†
Multi-chiplet designs implement ASICs and other integrated products across multiple die within a single package. The ODSA group aims to define an open logical
interface such that chiplets from multiple vendors can be composed to form domain-specific accelerators. As a part of this effort, the ODSA surveyed and analyzed a wide
range of new inter-chiplet PHY technologies. This paper reports the results of the survey. We develop a framework to evaluate these PHY technologies. Based on our analysis,
we propose the use of an abstraction layer so that multiple PHY technologies can present a common interface.
AI-engines for Real-Time Classification of Data Flows for Optimal Routing in IP Networks
As more and more interconnect technologies are developed with a range of diverse
capabilities, optimal workload management requires the intelligent matching of data
traffic types with network capabilities on a dynamic basis. This, in turn, requires fast,
real-time classification of data traffic into useful "buckets." This talk introduces the AIbased
classification of data flows (as opposed to applications or "intent") by examining
the first few IP packet headers. These AI-engines can be tuned differently in different
parts of the network, or under different circumstances. The talk will cover use cases in
data center networks as well as wireless or wireline carrier networks.
Hus Tigli has founded Xaxar, his fifth start-up, in 2018 and serves as its Chairman and
CEO. His previous four start-ups emerged as leaders in photonics, optical networking
and mixed-signal IC's -which were either acquired by or whose technologies licensed
to industry giants.
Prior to his entrepreneurial career starting in 2000, Tigli ran businesses with revenues
from $5 million to $900 million at Raychem Corporation, a publicly traded innovator of
materials-science based components for electronics and telecom markets.
Tigli received his BS and MS in engineering at Columbia University and an MBA from
Heterogeneous Compute Elasticity: Computing Beyond the Box with PCIe Networking
The slowing down of Moore's-law coupled with the explosion of new storage and
compute intensive applications like deep learning and data analytics have created a
severe challenge for data center networking in the core and at the edge. In the past,
these problems were the domain of High Performance Computing but now enterprise,
edge and cloud segments are all impacted. Existing legacy networking technologies like
Ethernet, Fibre Channel and Infiniband are unable to deliver the bandwidth and latency
performance required especially. Meanwhile, PCIe the ubiquitous connectivity solution
for compute and storage inside a server, has remained trapped inside the box. Until
GigaIO's FabreX is a PCIe standards-based network that addresses these challenges.
In addition to providing unparalleled latency and bandwidth performance, the fabric can
natively support NVMe-oF and GDR devices, removing the extra overhead associated
with transferring data over another transport. We will present our S/W and H/W
architecture and support for memory semantic capability for emerging SCM. Measured
data from internal testing and from San Diego Supercomputer Center (SDSC) will be
presented to demonstrate the performance and efficiency benefits of FabreX.
Scott Taylor, GigaIO Networks
Scott Taylor has an extensive background in high speed networking, accelerators and
security from working at companies like Cray Research and Sun Microsystems.
Leveraging this background, he created the FabreX software architecture supporting
Redfish Composability Service, NVMe-oF, GPU Direct RDMA, accelerators, MPI and
TCP/IP all with a single PCI-compliant interconnect. He has built the engineering team
at GigaIO from the ground up to implement a singular vision of FabreX as an open
source, standards-based ecosystem. ScottÕs previous experience includes Prisa
Networks, a Fiber Channel startup, where he helped drive the shift from an arbitrated
loop to switch based topologies. His many years working as an expert consultant helps
him drive key intellectual property development at GigaIO. Scott holds a BS in computer
science from UC Santa Barbara.
Building Large Scale Data Centers: Cloud Network Design Best Practices
The talk examines the network design principles of large scale cloud networks that allow
Cloud Service providers to achieve throughputs in excess of 10 Pbps in a single
Andy Bechtolsheim, Arista Networks
As Chief Development Officer, Andy Bechtolsheim is responsible for the overall product
development and technical direction of Arista Networks.
Previously Andy was a Founder and Chief System Architect at Sun Microsystems,
where most recently he was responsible for industry standard server architecture. Andy
was also a Founder and President of Granite Systems, a Gigabit Ethernet startup
acquired by Cisco Systems in 1996. From 1996 until 2003 Andy served as VP/GM of
the Gigabit Systems Business Unit at Cisco that developed the very successful Catalyst
4500 family of switches. Andy was also a Founder and President of Kealia, a next
generation server company acquired by Sun in 2004.
Andy received an M.S. in Computer Engineering from Carnegie Mellon University in
1976 and was a Ph.D. Student at Stanford University from 1977 until 1982. He was coawarded
the prestigious "EY 2015 Entrepreneur of the Year" across National USA.
Moderator: Uri Cummings, Intel
Data Center Transformation — How will the DataCenter look Different in 5 Years?
- How will system architecture change towards connectivity
- Edge Networking
- Scalable compute workloads in the cloud
- Serverless computing and its impact on networking
- I/O wall and shift towards optics
- Competing technologies (CXL vs CSIX vs TileLink vs NVLink)
Andreas Bechtolsheim, Arista
Hong Liu, Google
Michael Kagan, Mellanox
Tom Tofigh, QCT
||Friday, August 16 (Symposium)
||Breakfast and Registration
Rosetta: A 64-port Switch for Cray's Slingshot Interconnect
Steve Scott, Cray
|Links and FPGAs
Enabling Standalone FPGA Computing
J. Lant, J. Navaridas, A. Attwood, M. Lujan, and J. Goodacre
One of the key obstacles in the advancement of large-scale distributed FPGA platforms is the ability of the accelerator to act autonomously from the CPU,
whilst maintaining tight coupling to system memory. This work details our efforts in decoupling the networking capabilities of the FPGA from CPU resources using
a custom transport layer and network protocol. We highlight the reasons that previous solutions are insufficient for the requirements of HPC, and we show the performance
benefits of offloading our transport into the FPGA fabric. Our results show promising throughput and latency beneŽts, and show competitive Flops being achievable for network
dependent computing in a distributed environment.
The University of Manchester, UK
A Bunch of Wires (BoW) Interface for Inter-Chiplet Communication
R. Farjadrad and B. Vinnakota*
Multi-Chiplet system-in-package designs have recently received a lot of attention as a mechanism to combat high SoC design costs and to economically manufacture
large ASICs. Multi-Chiplet designs require low-power area-efficient inter-Chiplet communication. Current technologies either extend on-chip high-wire count buses
using silicon interposers or off-package serial buses over organic substrates. The former approach leads to expensive packaging. The latter to complex design. We propose a
simple Bunch of Wires (BoW) interface that combines the ease of development of parallel interfaces with the low cost of organic substrates.
Demonstration of a Single-Lane 80 Gbps PAM-4 Full-Duplex Serial Link
S. Goyal, P. Agarwal, and S. Gupta
Serial interface standards, such as Thunderbolt, SATA, USB and Ethernet, are constantly being upgraded toa chieve higher speed bi-directional data connectivity for a
wide range of applications. To meet this requirement, such links use multiple lanes, and therefore have to deal with near-end and far-end cross-talks, which makes their
implementation challenging. In addition, multiple pairs of wires make the interface cables bulky, and the use of hybrid transformers for self-interference (SI) cancellation
(as in the case of Ethernet links) makes their form factors unattractive.
In this work, we propose full-duplexing using broadband SI cancellation techniques to demonstrate a single-lane bi-directional PAM-4 link, with an aggregate data rate of 80 Gbps.
The demonstrated link does not require any hybrid transformer for SI-cancellation, and achieves a raw bit-error-rate (BER) of < 10-9 over a 1 m long coaxial cable.
Indian Institute of Technology Bombay, India
|MPI and HPC
Improved MPI Multi-threaded performance using OFI Scalable Endpoints
A. Gopalakrishnan, M. Cabral, J. Erwin, and R. B. Ganapathi
Message Passing Interface (MPI) applications are launched as a set of parallel homogeneous processes, commonly with one to one mapping between MPI processes and compute
cores. With the growing complexity of MPI applications and compute node processors consisting of large numbers of cores, launching a small number of MPI processes with several
lightweight threads per process is becoming popular. Task based programming models in combination with MPI also provide several benefits for the application to exploit intra node
parallelism. Naϊve implementation of MPI_THREAD_MULTIPLE can be expensive with minimal or no performance beneŽts. We demonstrate a high-performance end to end multi-threading
solution across the MPI application and MPI runtime, with threads mapping to hardware resources. We demonstrate our solution with Open MPI using Libfabric (a.k.a. OpenFabrics
Interfaces OFI) and its Intel Omni-Path Performance Scaled Messaging 2 (PSM2) provider. Our tests with Intel® MPI Benchmarks Multi Thread set (IMB-MT) show BW improvement for
large message sizes when running with multiple threads. We also demonstrate up to 2.5x performance improvements with Baidu All-Reduce. Even though the experiments were run on
Intel® Omni-Path Architecture fabric, the solution can be applied to other fabrics with the capability of allocating resources among multiple threads.
Designing Scalable and High-performance MPI Libraries on Amazon Elastic Fabric Adapter
S. Chakraborty, S. Xu, H. Subramoni, and D. K. Panda
Amazon has recently announced a new network interface named Elastic Fabric Adapter (EFA) targeted towards tightly coupled HPC workloads. In this paper, we characterize the features,
capabilities and performance of the adapter. We also explore how its transport models such as UD and SRD (Scalable Reliable Datagram) impact the design of high-performance MPI
libraries. Our evaluations show that hardware level reliability provided by SRD can significantly improve the performance of MPI communication. We also propose a new zero-copy transfer
mechanism over unreliable and orderless channels that can reduce the communication latency of large messages. The proposed design also shows significant improvement in
collective and application performance against the vendor provided MPI library.
The Ohio State University, USA
A Study of Network Congestion in Two Supercomputing High-Speed Interconnects
S. Jha, A. Patke, J. Brandt*, A. Gentile*, M. Showerman, E. Roman, Z. Kalbarczyk, B. Kramer,
and R. K. Iyer.
Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from
both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the
system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects.
To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of
network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly
University of Illinois at Urbana-Champaign, USA
Sandia National Lab*, USA
|Efficient Network Design and Use
Communication Profiling and Characterization of DeepLearning Workloads on Clusters with High-Performance Interconnects
A. A. Awan, A. Jain, C.-H. Chu, H. Subramoni, and D. K. Panda
Heterogeneous HPC systems with GPUs are increasingly getting equipped with on-node interconnects like PCIe and NVLink and inter-node interconnects like InŽniBand and Omni-Path. However,
the efŽcient exploitation of these interconnects brings forth many challenges for MPI+CUDA applications. Little exists in the literature that captures the impact of these interconnects on
emerging application areas like distributed Deep Learning (DL). In this paper, we choose Horovod; a distributed training middleware, to analyze and profile high-level application workloads (e.g.,
Training ResNet-50) instead of MPI microbenchmarks. It is challenging to use existing profilers like mpiP and nvprof as they only offer a black box approach and cannot profile emerging communication
libraries like NCCL. To address this, we developed a proŽler for Horovod that enables profiling of various communication primitives including MPIAllreduce and ncclAllreduce for gradient exchange as well
for Horovod's communication threads and response caches. We analyze the following metrics to gain insights into network-level performance on different interconnects: 1) Message size with tensor
fusion, 2) Message size without tensor fusion, 3) Number of MPI and NCCL calls made for each message size, and 4) Time taken by each NCCL and/or MPI call. We also correlate these low-level statistics
to higher level end-to-end training metrics like images per second. Three keys insights we gained are: 1) Horovod tensor fusion offers slight performance gains (up to 5%) for CPU-based training on
InfiniBand systems, 2) For GPU-based training, disabling tensor fusion improved performance (up to 17%) for GPUs connected with PCIe, and 3) The allreduce latency profiles show some extreme performance
variations for non-power-of- two message sizes for both CPUs and GPU on all interconnects when tensor fusion is enabled. To provide a comprehensive view of performance, we use a wide variety of systems
with CPUs like Intel Skylake, AMD EPYC, and IBM POWER9, GPUs like Volta V100, and interconnects like PCIe, NVLink, InŽniBand, and Omni-Path.
The Ohio State University, USA
Lightweight, Packet-centric Monitoring of Network Traffic and Congestion Implemented in P4
P. Taffet and J. Mellor-Crummey
Communication cost is an important factor for distributed applications running in data centers. To improve communication performance, developers need tools that enable them
to measure and to understand how their application's communication patterns interact with the network, especially when those interactions result in congestion. This paper describes
a lightweight sampling-based technique for monitoring communication that has a switch help a packet collect information about the path it takes from source to destination and congestion
it encounters along the way. This scheme has essentially no bandwidth overhead, as it stores only a few bits of information in the header of a monitored IP packet, making it practical to monitor
every packet. In our prior work, network simulations of large-scale tightly-coupled HPC applications showed this approach can provide detailed information about traffic and congestion that is useful
for diagnosing the problem's root cause. Here, we describe an implementation of this scheme in P4 for data center networks and demonstrate its functionality with a basic experiment.
Rice University, USA
OmniXtend: Direct to Caches over Commodity Fabric
M. Radi, W. Terpstra*, P. Loewenstein, and D. Vucinic
There is a dearth of interfaces for efficient attachment of new kinds of non-volatile memory and purpose-built compute accelerators to processor pipelines. Early
integrated microprocessors exposed an off-chip front-side bus to which discrete memory and peripheral controllers could attach in a standardized fashion. With the advent
of symmetric multiprocessing and deep caches, this direct connection, together with memory controllers, has been implemented primarily using proprietary on-die technology.
Proprietary interconnects and protocols hinder architectural innovation and are at odds with the open nature of the rapidly growing RISC-V movement.
In this paper we introduce OmniXtend, a fully open coherence protocol meant to restore unrestricted interoperability of heterogeneous compute engines with a wide variety
of memory and storage technologies. OmniXtend supports a four-hop MESI protocol and is designed to take advantage of a new wave of Ethernet switches with stateful and programmable
data planes to facilitate system scalability. Ethernet transport was selected as a starting point for its ubiquity and historic resilience to reduce barriers to entry at modern bandwidths
and latencies. Moreover, it allows us to build upon a vibrant ecosystem of hardware and IP, and to provide a boost to architectural innovation through the use of feld-reconfigurable
networking hardware. We briefly discuss the protocol operation and show performance measurements of the first ever NUMA RISC-V system prototype.
Western Digital, USA
Latency Critical Operation in Network Processors
S. Roy, A. Kaushik, R. Agrawal, J. Gergen, W. Rouwet, and J. Arends
This paper presents the recent advancements made on the Advanced-IO-Processor (AIOP), a Network Processor (NPU) architecture designed by NXP Semiconductors.
The base architecture consists of multi-tasking PowerPC processor cores combined with hardware accelerators for common packet processing functions. Each core is equipped with
dedicated hardware for rapid task scheduling and switching on every hardware accelerator call, thus providing very high throughput. A hardware pre-emption controller snoops on the
accelerator completions and sends task pre-emption requests to the cores. This reduces the latency of real-time tasks by quickly switching to the high priority task on the core without
any performance penalty. A novel concept of priority-thresholding is further used to avoid latency uncertainty on lower priority tasks. The paper shows that these features make the
AIOP architecture very effective in handling the conßicting requirements of high-throughput and low-latency for next-generation wireless applications like WiFi (802.11ax) and 5G.
In presence of frequent pre-emptions, the throughput reduces by only 3% on AIOP, compared to 25% on optimized present-day NPU architectures. Further, the absolute throughput and
latency numbers are 2X better.
NXP Semiconductors, Netherlands