Loading Events

ASTRA-sim and Chakra: Enabling Software-Hardware Co-design Exploration for Distributed Machine Learning Platforms

August 23 @ 3:00 pm - 5:00 pm UTC+0

Presenter Names: Tushar Krishna and William Won (Georgia Tech)

Abstract: As Artificial Intelligence (AI) models are scaling at an unprecedented rate, Machine Learning (ML) execution heavily relies on Distributed ML over customized neural accelerator (e.g., GPU or TPU)-based High-Performance Computing (HPC) platforms connected via high-speed interconnects (e.g., NVLinks). Deep Neural Network (DNN) execution involves a complex interplay between the DNN architecture, parallelization strategy, scheduling strategy, collective communication, network topology, memory accesses, and the accelerator endpoint, as shown in Figure 1. Therefore, there is a need for a comprehensive methodology to understand and navigate this complex intertwined co-design space to (i) architect future platforms, (ii) develop novel parallelism schemes to support efficient training of future DNN models, and (iii) develop novel fabrics for AI systems. As an ongoing collaboration between Georgia Tech and several companies (Intel, Meta, AMD, NVIDIA, and HPE), we have been jointly developing a detailed cycle-accurate distributed AI simulator called ASTRA-sim1. ASTRA-sim models the co-design space of distributed ML described above and schedules the compute-communication interactions over plug-andplay computation, network, and remote memory simulators. ASTRA-sim leverages the MLCommons Chakra 2 format to describe arbitrary distributed ML workloads. To the best of our knowledge, ASTRA-sim is the first open-source simulator for modeling future distributed ML platforms.

Bio: Dr. Tushar Krishna (tushar@ece.gatech.edu) is an Associate Professor in the School of Electrical and Computer Engineering at Georgia Tech. He is currently also a Visiting Associate Professor at the Department of Electrical Engineering and Computer Science at MIT. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007). Before joining Georgia Tech in 2015, Dr. Krishna spent a year as a researcher at the VSSAD group at Intel, Massachusetts. Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on- chip (NoC), and AI/ML accelerator systems, with a focus on optimizing data movement in modern computing platforms. His research is funded via multiple awards from NSF, DARPA, IARPA, SRC (including JUMP2.0), Department of Energy, Intel, Google, Meta/Facebook, Qualcomm and TSMC.

William Won (william.won@gatech.edu) is a PhD student in the College of Computing at Georgia Tech. His research interests include training and inference of distributed machine learning, simulation of distributed machine learning workloads, collective communication optimizations, and machine learning algorithms.