Loading Events
  • This event has passed.

Technical Paper Session D: Networks for Large Language Models

August 24 @ 1:30 pm - 3:00 pm UTC-7

Technical Paper Session D: Networks for Large Language Models

Session Chair: Dave Ozog (Intel)

The Case for Domain-Specific Networks

Authors: Dennis Abts (NVIDIA) and John Kim (KAIST).

The combination of 100-billion-parameter large-language models and trillion-token data sets are intense computing pressures for modern supercomputing infrastructure. This paper looks at the role of the interconnection network in these large-scale
systems that play a pivotol role in the march of progress necessary for modern AI systems. Specifically, we argue for domain-specific networks and the need for flexible low-latency interconnects that can deliver high-throughput at large scale with 10s-
of-thousands of endpoints where reliability and resilience to errors are paramount to successfully surviving these long-running training workloads. The role of a domain-specific network is to support specific communication or data movement between the domain-specific accelerators at full data rate; thus, an oversubscribed general-purpose network may not necessarily be suitable. In this positional paper, we make the case for domain-specific networks for domain-specific accelerator based systems. We also provide a case study of recent domain-specific networks that have been built in the industry and describe the challenges and opportunities for domain-specific networks.

Performance Characterization of Large Language Models on High-Speed Interconnects.

Authors: Hao Qi (UC Merced), Liuyao Dai (UC Merced), Weicong Chen (UC Merced), Zhen Jia (AWS) and Xiaoyi Lu (UC Merced).

Abstract: Large Language Models (LLMs) have recently gained significant popularity due to their ability to generate human-like text and perform a wide range of natural language processing tasks. Training these models usually requires a large amount of computational resources and is often done in a distributed manner. The use of high-speed interconnects can significantly influence the efficiency of distributed training. Therefore, there poses a need for systematic studies to explore the distributed training characteristics of these models on high-speed interconnects. This paper presents a comprehensive performance characterization of representative large language models: GPT, BERT, and T5. We evaluate their training performance in terms of iteration time, interconnect utilization, and scalability, over different high-speed interconnects and communication protocols, including TCP/IP, IPoIB, and RDMA. We observe that interconnects play a vital role in training LLMs. Specifically, RDMA outperforms IPoIB and TCP/IP by an average of 2.51x and 4.79x regarding training performance, and scores the highest interconnect utilization (up to 60 Gbps) in both strong and weak scaling, compared to IPoIB with up to 20 Gbps and TCP/IP with up to 9 Gbps, leading to the fastest training time. We also observe that larger models tend to have higher requirements for communication bandwidth, especially for AllReduce during backward propagation, which can take up to 91.72% of training time. Through our evaluation, we envision opportunities to improve the communication time for better training performance of LLMs. We extensively explore and summarize the role communication plays in distributed LLM training.