Intelligence Augmented Compter Systems
Computer systems are rapidly evolving to meet the computational demands of emerging application demands by incorporating innovations in hardware architecture, operating systems, network interconnects, and storage, leading to increased heterogeneity. The state of the art is to vertically integrate these systems with painstakingly built, handcrafted, average-case heuristics. Heuristic generation is already a fundamental challenge, as variations across machine configurations, workloads, and deployment environments can make heuristic generation painful and costly. Moreover, we are reaching the limits of conventional approaches of generating heuristics, which involve recurring human-expert-driven engineering efforts.
My research addresses the above challenge by providing intelligent control, management, and optimization of large-scale heterogeneous computer systems in a fundamental way, starting with mathematical models and ending with real software and hardware that provides efficient, scalable, and composable system management solutions. It does so by building innovations at the intersection of systems, machine learning (ML), and computer architecture to develop computer systems that continuously monitor themselves, adapting both their behavior and internal models to ensure that the users' throughput, latency, and resilience goals are met in complex, dynamic environments.
We have used those techniques to implement:
- Policies for Automated Resource Management: We have built several systems for performance-oriented
resource management in heterogeneous clusters.
- We have built Symphony to schedule data-flow graphs across heterogeneous clusters containing multiple types of CPUs as well as accelerators like GPUs and FPGAs.
- We have built FIRM for reallocating resources to microservices in order minimize tail latency and sustain SLOs.
- We have built ML-LB to load-balance threads across multiple scheduling domains in Linux’s Completely Fair Scheduler.
- Policies for Automated Resilience Management: We have built several systems for diagnosing and
correcting errors in large heterogeneous systems.
- We have built BayesPerf for correcting measurement errors in the input telemetry data to the ML-controllers.
- We have built Kaleidoscope for the diagnosis and localization of failures in large disaggregated storage systems.
- We have built BFI for targeted test-case generation for fault injection campaigns to test the resilience of ML-controllers.
- Enabling low-latency training and inference: We have built several techniques to satisfy the
tight latency constraints required by these ML-controllers:
- We have designed sampling-based approximate inference methods for hybrid Bayesian-deep learning models.
- We have designed AcMC² a high-level synthesis compiler for FPGA-based Markov chain Monte Carlo accelerators to target sampling-based training and inference of ML-controllers.
Moreover, applications, their computational requirements, and their ease of programming have been essential in my research and a driving force behind the broader goals of managing, controlling, and optimizing computer system performance using ML. For example, in my work on designing and implementing a workload optimized computing system for computational genomics and precision medicine applications [1, 2, 3, 4].
Related Publications
Is Function-as-a-Service a Good Fit for Latency-Critical Services?.
WoSC 2021 (Colocated with Middleware 2021).
Improved GPU Implementations of the Pair-HMM Forward Algorithm for DNA Sequence Alignment.
ICCD 2021.
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics.
ASPLOS 2021.
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems.
Supercomputing 2020.
- Best Paper & Best Student Paper Finalist
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices.
OSDI 2020.
Machine Learning for Load Balancing in the Linux Kernel.
ApSys 2020.
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters.
ICML 2020.
AcMC²: Accelerated Markov Chain Monte Carlo for Probabilistic Models.
ASPLOS 2019.
CAUDIT: Continuous Auditing of SSH-Servers To Mitigate Brute-Force Attacks.
NSDI 2019.
ASAP: Accelerated Short Read Alignment on Programmable Hardware.
IEEE Transactions on Computers.
A ML-based Runtime System for Executing Dataflow Graphs on Heterogeneous Processors.
SoCC 2018.
Symphony: Leveraging Probabilistic Graphical Models to Schedule Tasks to Clusters of Heterogeneous Processors.
AISys 2017 (Colocated with SOSP 2017).
On Accelerating Pair-HMM Computations in Programmable Hardware.
FPL 2017.
-
- Best Paper Award
Bringing Innovations in Systems and Analytics to the Bedside: Design of the CompGen Machine.
ECCB 2017.
ASAP: Accelerated Short Read Alignment on Programmable Hardware.
FPGA 2017.
Efficient and Scalable Workflows for Genomic Analyses.
DIDC 2016 (Colocated with HPDC 2016).
IGen: The Illinois Genomics Execution Environment.
Supercomputing 2015 (SRC).
Decomposing Genomics Algorithms: Core Computations for Accelerating Genomics Analyses.
Coordinated Science Laboratory Technical Report UILU-ENG-14-2201.