@ NVIDIA GTC 2017

One of the most exciting conferences I have ever attended!

Cinque Terre

Ammar Ahmad Awan


I am a Researcher at Microsoft working on the DeepSpeed library with Yuxiong He and the DeepSpeed team.


My contact information is here.

Research Statement (Concise Version)

My research work is geared towards co-design of high-performance communication middleware (e.g. Message Passing Interface (MPI) libraries) and Deep Learning (DL) frameworks (e.g. TensorFlow, PyTorch, etc.). Deep Learning (DL) has proven to be one of the most disruptive technology in recent times and has literally taken over the entire stratosphere of research; from systems to security, and from algorithms to computer architectures, DL is at the core of modern Artificial Intelligence (AI) revolution and is increasingly gettingadopted for solving complex problems. From computer vision and speech understanding to machine translation, pattern matching, and even scientific simulations, DL is being incorporated across the board. While these applications are diverse, the key piece in this success story are DNNs and their powerful predictioncapabilities that require a massive amount ofcomputationto be fullytrained; such that the need has even triggered a fundamental shift in the design philosophy of computer architectures. Because DNN training is compute-intensive, we use fast hardware like GPUs to speed up this process but there are limits a single GPU’s capability. State-of-the-art DNNs will take weeks to train even on the latest and greatest GPUs. To cater for this, using High Performance Computing (HPC) clusters with hundreds of GPUs is a path forward but not without its own set of research challenges that need to be addressed. And what DNNs are to DL, MPI is to HPC; which leads to my main research philosophy, i.e., without co-designing MPI and DL; two highly important but vastly different disciplines, it is extremely hard to realize scalable and distributed DNN training systems. The “need for speed” in DL is what pushes the best in HPC; and this leads to my major research challenge: How to co-design MPI libraries and DL frameworks to achieve efficient, scalable, and distributed DNN training on HPC clusters with heterogeneous resources including multi-core CPUs, GPUs,and network interfaces?

To address this broad problem, there are several concrete challenges that need to be addressed:

  1. How to develop performance characterization and profiling infrastructures and strategies that offer a holistic view of distributed DNN training performance in HPC environments?
  2. How to accelerate data-parallel training of DNNs on hundreds of GPUs in an efficient fashion?
  3. How to co-design the MPI middleware and DL frameworks to extract best performance for heterogeneous HPC systems?
  4. How to efficiently deal with very-large and out-of-core DNNs that do not fit a single GPU or CPU’s memory?

Instead of focusing on each of these challenges separately, I propose a new research subarea that we call High Performance Deep Learning (HiDL). The main idea is to solve the challenges mentioned above in a coherent fashion. My core philosophy is that co-design of High Performance Computing (HPC) libraries like MVAPICH and DL frameworks like Caffe, TensorFlow, and others leads to the best possible performance that one can achieve. Rather than developing and designing DL frameworks and MPI libraries in isolation, I take a holistic approach to performance introspection, communication profiling, co-design of MPI and DL to accelerate data-parallel DNN training, and effective strategies to deal with very-large (out-of-core) DNNs. The published articles along these solutions can be seen from my publications page.