- This event has passed.
Principles and Practice of Scalable and Distributed Deep Neural Networks Training and Inference
Presenter Names: Dhabaleswar K. (DK) Panda, Hari Subramoni, Aamir Shafi, Nawras Alnaasan
Abstract: Recent advances in Deep Learning (DL) have led to many exciting challenges and opportunities. Modern DL frameworks including TensorFlow, PyTorch, Horovod, and DeepSpeed enable high-performance training, inference, and deployment for various types of Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, DL frameworks and DL Training and Inference with special focus on parallelization strategies for large models such as GPT, LLaMA, BERT, ViT, and ResNet. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU and GPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed DL training and inference on a modern GPU cluster.
Bio: Dhabaleswar K (DK) Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He is serving as the Director of the ICICLE NSF-AI Institute (https://icicle.ai). He has published over 500 papers. The MVAPICH2 MPI libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,300 organizations worldwide (in 90 countries). More than 1.69 million downloads of this software have taken place from the project’s site. This software is empowering many clusters in the TOP500 list. High-performance and scalable solutions for Deep Learning frameworks and Machine Learning applications from his group are available from https://hidl.cse.ohio-state.edu. Similarly, scalable and high-performance solutions for Big Data and Data science frameworks are available from https://hibd.cse.ohio-state.edu. Prof. Panda is an IEEE Fellow and recipient of the 2022 IEEE Charles Babbage Award. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE & ACM.
Dr. Aamir Shafi is currently a Research Scientist in the Department of Computer Science & Engineering at the Ohio State University where he is involved in the High Performance Big Data project led by Dr. Dhabaleswar K. Panda. Dr. Shafi was a Fulbright Visiting Scholar at the Massachusetts Institute of Technology (MIT) in the 2010-2011 academic year where he worked with Prof. Charles Leiserson on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. He got his Bachelors in Software Engineering degree from NUST, Pakistan in 2003. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express. More details about Dr. Shafi are available from https://people.engineering.osu.edu/people/shafi.16.
Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA. He is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate the training of Deep Neural Networks and exploit underutilized HPC resources covering a wide range of DL applications including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several research projects including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University.