Principles and Practice of Scalable and Distributed Deep Neural Networks Training and Inference @ HotI 2025

Abstract

Recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities. Modern DL frameworks such as PyTorch and TensorFlow enable high-performance training, inference, and deployment for various types of Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, DL frameworks, and DL Training and Inference with special focus on parallelization strategies for large models such as GPT, LLaMA, DeepSeek, and ViT. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed DL training on a modern GPU cluster.

Tutorial Outline

  1. Introduction
  2. Deep Learning Frameworks
  3. Deep Neural Network Training
  4. Distributed Data-Parallel Training
  5. Latest Trends in High-Performance Computing Architectures
  6. Challenges in Exploiting HPC Technologies for DL
  7. Advanced Distributed Training
  8. Distributed Inference Solutions
  9. Open Issues and Challenges
  10. Conclusion and Q&A

Presenters

  • Dhabaleswar K. (DK) Panda, Ohio State University
  • Nawras Alnaasan, Ohio State University

Presenter Bios

Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He serves as the Director of the newly-established $20M NSF-AI Institute, ICICLE (https://icicle.ai). He has published over 500 papers in the area of high-end computing and networking. The MVAPICH (High-Performance MPI over InfiniBand, Omni-Path, iWARP, RoCE, and Slingshot) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,450 organizations worldwide (in 92 countries). More than 1.91 million downloads of this software have taken place from the project’s site. This software is empowering several InfiniBand clusters (including the 21st, 67th, and 88th ranked ones) in the TOP500 list. High-performance and scalable solutions for AI frameworks (Deep Learning and Machine Learning) from his group are available from https://hidl.cse.ohio-state.edu. Similarly, scalable, and high-performance solutions for Big Data and Data science frameworks are available from https://hibd.cse.ohio-state.edu. Prof. Panda is a Fellow of ACM and IEEE. He is a recipient of the 2022 IEEE Charles Babbage Award and the 2024 IEEE TCPP Outstanding Service and Contributions Award. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA. He is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate the training of Deep Neural Networks and exploit underutilized HPC resources covering a wide range of DL applications including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several research projects including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University.