Technical Paper Session C: Congestion Control
10:35 – 11:05: Improving Congestion Control through Fine-Grain Monitoring of InfiniBand Networks
Authors: Alberto Cascajo, Gabriel Gomez Lopez, Jesus Escudero-Sahuquillo, Pedro Javier Garcia, David E. Singh, Francisco Alfaro, Francisco J. Quiles and Jesus Carretero
Congestion situations are a serious threat to the performance of the interconnection networks of High-Performance Computing and Data-Center systems. Hence, the specifications of the main interconnect technologies, such as InfiniBand, define some mechanisms to deal with congestion and its effects. However, these standard mechanisms may not be suitable to detect or track accurately the actual status of network congestion, as congestion dynamics indeed can be very complex and varied. Moreover, achieving an optimal configuration of the parameters that drive the different functionalities of congestion-control mechanisms is often a difficult task, as some configurations may be suitable for some traffic scenarios, but not for others.
In this paper, we propose combining an existing light-weight platform monitoring tool (LIMITLESS) with the InfiniBand control software (OpenSM), such that the metrics about communication volumes in the network provided by the former allow the latter to have a more precise image of congestion status, then being able to react more efficiently in these situations. The main contributions of this paper are the methodology to link the monitor and OpenSM, as well as modifications in the InfiniBand standard congestion-control mechanism so that its reaction is modulated based on the enhanced knowledge about congestion provided by the monitor. These improvements are ready to be integrated into any InfiniBand-based system. According to the results from our experiments (performed in a real InfiniBand-based cluster where we run a widely used benchmark), the proposed improvements reduce significantly the number of wrong detections of congestion, and so the number of times that the congestion-control mechanisms react unnecessarily, hence improving system performance.
11:05 – 11:35: Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
Authors: Tarannum Khan, Saeed Rashidi, Pallavi Shurpali, Aditya Akella, Tushar Krishna and Srinivas Sridharan
RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the important role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, which is based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes are proposed to provide an additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments.
In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and network components, and exclusively run training workloads using collectives (All-Reduce, All-To-All) communication libraries for communication. Furthermore, these platforms usually have a private network, separating their communication traffic from the rest of the datacenter traffic. Scalable topology-aware collective algorithms are inherently designed to avoid incast patterns and balance traffic in an optimal manner. These distinct features necessitate revisiting previously proposed congestion control schemes for general purpose datacenter environments. In this paper, we provide a thorough analysis on some of the state-of-the-art RoCE congestion control schemes (DCQCN, DCTCP, TIMELY, and HPCC) vs. PFC when running on distributed training platforms. Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.