Intelligence Augmented Compter Systems

Computer systems are rapidly evolving to meet the computational demands of emerging application demands by incorporating innovations in hardware architecture, operating systems, network interconnects, and storage, leading to increased heterogeneity. The state of the art is to vertically integrate these systems with painstakingly built, handcrafted, average-case heuristics. Heuristic generation is already a fundamental challenge, as variations across machine configurations, workloads, and deployment environments can make heuristic generation painful and costly. Moreover, we are reaching the limits of conventional approaches of generating heuristics, which involve recurring human-expert-driven engineering efforts.

My research addresses the above challenge by providing intelligent control, management, and optimization of large-scale heterogeneous computer systems in a fundamental way, starting with mathematical models and ending with real software and hardware that provides efficient, scalable, and composable system management solutions. It does so by building innovations at the intersection of systems, machine learning (ML), and computer architecture to develop computer systems that continuously monitor themselves, adapting both their behavior and internal models to ensure that the users' throughput, latency, and resilience goals are met in complex, dynamic environments.

We have used those techniques to implement:

Policies for Automated Resource Management: We have built several systems for performance-oriented resource management in heterogeneous clusters.
- We have built Symphony to schedule data-flow graphs across heterogeneous clusters containing multiple types of CPUs as well as accelerators like GPUs and FPGAs.
- We have built FIRM for reallocating resources to microservices in order minimize tail latency and sustain SLOs.
- We have built ML-LB to load-balance threads across multiple scheduling domains in Linux’s Completely Fair Scheduler.
Policies for Automated Resilience Management: We have built several systems for diagnosing and correcting errors in large heterogeneous systems.
- We have built BayesPerf for correcting measurement errors in the input telemetry data to the ML-controllers.
- We have built Kaleidoscope for the diagnosis and localization of failures in large disaggregated storage systems.
- We have built BFI for targeted test-case generation for fault injection campaigns to test the resilience of ML-controllers.
Enabling low-latency training and inference: We have built several techniques to satisfy the tight latency constraints required by these ML-controllers:
- We have designed sampling-based approximate inference methods for hybrid Bayesian-deep learning models.
- We have designed AcMC² a high-level synthesis compiler for FPGA-based Markov chain Monte Carlo accelerators to target sampling-based training and inference of ML-controllers.

Moreover, applications, their computational requirements, and their ease of programming have been essential in my research and a driving force behind the broader goals of managing, controlling, and optimizing computer system performance using ML. For example, in my work on designing and implementing a workload optimized computing system for computational genomics and precision medicine applications [1, 2, 3, 4].

Related Publications

Is Function-as-a-Service a Good Fit for Latency-Critical Services?.
Haoran Qiu, Saurabh Jha, Subho S. Banerjee, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
WoSC 2021 (Colocated with Middleware 2021).
Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment.
Enliang Li, Subho S. Banerjee, Sitao Huang, Ravishankar K. Iyer, Deming Chen.
ICCD 2021.
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics.
Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ASPLOS 2021.
- DOI
- arXiv
- Short Talk
- Abstract
- Paper
- Slides
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems.
Saurabh Jha, Shengkun Cui, Subho S. Banerjee, Tianyin Xu, Jeremy Enos, Mike Showerman, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
Supercomputing 2020.
- Best Paper & Best Student Paper Finalist
- DOI
- Code
- Paper
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices.
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
OSDI 2020.
- DOI
- arXiv
- Code
- Data
- Paper
Machine Learning for Load Balancing in the Linux Kernel.
Jingde Chen, Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ApSys 2020.
- DOI
- Code (Kernel)
- Code (ML)
- Paper
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters.
Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ICML 2020.
- DOI
- arXiv
- Paper
- Slides
- CSL News
AcMC²: Accelerated Markov Chain Monte Carlo for Probabilistic Models.
Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ASPLOS 2019.
- DOI
- Lightning Talk
- Paper
- Slides
CAUDIT: Continuous Auditing of SSH-Servers To Mitigate Brute-Force Attacks.
Phuong M. Cao, Yuming Wu, Subho S. Banerjee, Justin Azoff, Alex Withers, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
NSDI 2019.
- DOI
- Data
- Paper
- CSL News
ASAP: Accelerated Short Read Alignment on Programmable Hardware.
Subho S. Banerjee, Mohamed el-Hadedy, Jong B. Lim, Steve Lumetta, Zbigniew T. Kalbarczyk, Deming Chen, and Ravishankar K. Iyer.
IEEE Transactions on Computers.
- DOI
- arXiv
- Paper
A ML-based Runtime System for Executing Dataflow Graphs on Heterogeneous Processors.
Subho S. Banerjee, Steve Lumetta, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
SoCC 2018.
- DOI
- Paper
Symphony: Leveraging Probabilistic Graphical Models to Schedule Tasks to Clusters of Heterogeneous Processors.
Subho S. Banerjee, Steve Lumetta, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
AISys 2017 (Colocated with SOSP 2017).
- Paper
- Poster
- Slides
On Accelerating Pair-HMM Computations in Programmable Hardware.
Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, Steve Lumetta, and Ravishankar K. Iyer.
FPL 2017.
- DOI
- Paper
- Slides
Data-Driven Longitudinal Modeling and Prediction of Symptom Dynamics in Major Depressive Disorder: Integrating Factor Graphs and Learning Methods.
Arjun Athreya, Subho S. Banerjee, Drew Neavin, Rima Kaddurah-Daouk, A. John Rush, Mark A. Frye, Liewei Wang, Richard M. Weinshilboum, William V. Bobo, and Ravishankar K. Iyer.
CIBCB 2017.
- Best Paper Award
- DOI
- Paper
- Forbes
Bringing Innovations in Systems and Analytics to the Bedside: Design of the CompGen Machine.
S. S. Banerjee, A. P. Athreya, Y. Varatharajah, M. Aly, C. Tan, Z. Stephens, Z. Kalbarczyk, S. Lumetta, L. Wang, R. Weinshilboum, and R. K. Iyer.
ECCB 2017.
- Web
- Paper
ASAP: Accelerated Short Read Alignment on Programmable Hardware.
Subho S. Banerjee, Mohamed el-Hadedy, Jong B. Lim, Daniel Chen, Zbigniew T. Kalbarczyk, Deming Chen, and Ravishankar K. Iyer.
FPGA 2017.
- DOI
- Poster
Efficient and Scalable Workflows for Genomic Analyses.
Subho S. Banerjee, Arjun P. Athreya, Liudmila S. Mainzer, C. Victor Jongeneel, and Wen-Mei Hwu, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
DIDC 2016 (Colocated with HPDC 2016).
- DOI
- Paper
- Slides
IGen: The Illinois Genomics Execution Environment.
Subho S. Banerjee and Ravishankar K. Iyer.
Supercomputing 2015 (SRC).
- Web
- Paper
- Poster
Decomposing Genomics Algorithms: Core Computations for Accelerating Genomics Analyses.
Arjun P. Athreya, Subho S. Banerjee, C. Victor Jongeneel, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
Coordinated Science Laboratory Technical Report UILU-ENG-14-2201.
- Paper

Subho Sankar Banerjee

Intelligence Augmented Compter Systems

Related Publications