Intelligence Augmented Compter Systems
Computer systems are rapidly evolving to meet the computational demands of emerging application demands by incorporating innovations in hardware architecture, operating systems, network interconnects, and storage, leading to increased heterogeneity. The state of the art is to vertically integrate these systems with painstakingly built, handcrafted, average-case heuristics. Heuristic generation is already a fundamental challenge, as variations across machine configurations, workloads, and deployment environments can make heuristic generation painful and costly. Moreover, we are reaching the limits of conventional approaches of generating heuristics, which involve recurring human-expert-driven engineering efforts.
My research addresses the above challenge by providing intelligent control, management, and optimization of large-scale heterogeneous computer systems in a fundamental way, starting with mathematical models and ending with real software and hardware that provides efficient, scalable, and composable system management solutions. It does so by building innovations at the intersection of systems, machine learning (ML), and computer architecture to develop computer systems that continuously monitor themselves, adapting both their behavior and internal models to ensure that the users' throughput, latency, and resilience goals are met in complex, dynamic environments.
We have used those techniques to implement:
- Policies for Automated Resource Management: We have built several systems for performance-oriented
resource management in heterogeneous clusters.
- We have built Symphony to schedule data-flow graphs across heterogeneous clusters containing multiple types of CPUs as well as accelerators like GPUs and FPGAs.
- We have built FIRM for reallocating resources to microservices in order minimize tail latency and sustain SLOs.
- We have built ML-LB to load-balance threads across multiple scheduling domains in Linux’s Completely Fair Scheduler.
- Policies for Automated Resilience Management: We have built several systems for diagnosing and
correcting errors in large heterogeneous systems.
- We have built BayesPerf for correcting measurement errors in the input telemetry data to the ML-controllers.
- We have built Kaleidoscope for the diagnosis and localization of failures in large disaggregated storage systems.
- We have built BFI for targeted test-case generation for fault injection campaigns to test the resilience of ML-controllers.
- Enabling low-latency training and inference: We have built several techniques to satisfy the tight latency constraints required by these ML-controllers:
Moreover, applications, their computational requirements, and their ease of programming have been essential in my research and a driving force behind the broader goals of managing, controlling, and optimizing computer system performance using ML. For example, in my work on designing and implementing a workload optimized computing system for computational genomics and precision medicine applications [1, 2, 3, 4].
Is Function-as-a-Service a Good Fit for Latency-Critical Services?.
WoSC 2021 (Colocated with Middleware 2021).
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems.
- Best Paper & Best Student Paper Finalist
ASAP: Accelerated Short Read Alignment on Programmable Hardware.
IEEE Transactions on Computers.
Symphony: Leveraging Probabilistic Graphical Models to Schedule Tasks to Clusters of Heterogeneous Processors.
AISys 2017 (Colocated with SOSP 2017).
- Best Paper Award
Efficient and Scalable Workflows for Genomic Analyses.
DIDC 2016 (Colocated with HPDC 2016).
IGen: The Illinois Genomics Execution Environment.
Supercomputing 2015 (SRC).
Decomposing Genomics Algorithms: Core Computations for Accelerating Genomics Analyses.
Coordinated Science Laboratory Technical Report UILU-ENG-14-2201.