About me
I am a tenure-track assistant professor in Department of Computer Science at Tianjin University and a member of TANK Lab, led by Prof.Keqiu Li. I received my Ph.D. degree from Networked Systems Lab at University of Southern California, advised by Prof.Ramesh Govidan. I obtained my B.S. degree at Shanghai Jiao Tong University, advised by Prof. Xinbing Wang.
My research interests include large language model (LLM) systems, deep neural network (DNN) systems, performance analysis and optimization, parallel and distributed computing. My recent work delves into developing inference systems capable of deploying LLM and DNN models in large-scale cloud clusters, aiming for peak performance, efficiency and scalability through innovative techniques such as computational acceleration, parallel optimization, and resource orchestration. In collaboration with research institutions like IBM Watson, Samsung Research and Microsoft Research, I have published tens of papers at the leading conferences/journals, including SoCC, Ubicomp, INFOCOM, IWQoS, ASPLOS, SIGCOMM and TPDS. My research has been funed by NSFC, etc. I have received honors such as Chun-Tsung Scholar from Shanghai Jiao Tong University and Qiming Scholar from Tianjin University.
Recently, I am actively developing Twen.ai, the very first university Q&A large language model. Empowered by RAG techniques, Twen addresses daily questions from students and faculties in areas such as daily life, scholarship selection, further studies, etc. Twen is officially released in April 2024, and serves thousands of requests each day since then.
I am looking for self-motivated students interested in building systems for large language model and deep neural network. Feel free to drop me an email if you want to join us!
Research
My research is aiming to build inference systems capable of deploying LLM and DNN models in large-scale cloud clusters with peak performance, efficiency and scalability.
-
Large Language Model System
- Seving Classic LLM: Serving LLM applications brings new challenges due to their huge memory consumption and unpredictable output length. We designed novel LLM inference systems (qLLM, tgLLM) to minimize job completion time across LLM requests and to maximize model throughput and resource utilization. We also built various inference systems (InferRAG, InferMM) to manage computation resources under scenarios such as RAG and multi-modal.
- Serving Specialized LLM: Recent innovations in LLM architecture also bring new challenges. We designed specialized inference systems (SpecInfer, ParaMoE) to optimize the inference pipeline for speculative decoding and mixture of expert. Besides, we also investigated interesting topics such as lookahead decoding, LoRA serving, kv-cache optimization, etc.
-
Deep Neural Network System
- Latency Sensitive Inference: To guarantee good user experiences, DNN-based applications are usually associated with a latency objective. We designed various model orchestration systems (Harpagon, DeepLat, TopInfer) to minimize the serving cost under latency objective via techniques such as dynamic batching, request dispatching and configuration decoupling. We also built various resource scaling systems (SLOpt, DeepChain) to maximize system goodput under bursty workload via techniques such as AoT compilation and model pre-warmup.
- Complex Scenario: Given the use cases, DNN-based applications face various deployment requirements. We have designed multi-stage inference systems (Scrooge, Rim, Olympian) to manage DNN models in edge/cloud GPU clusters via techniques such as model co-location and model promotion. We also built specialized systems (ALPS, HRL) to handle complex scenario such as multi-modal input and heterogeneous hardware.
Selected Publications
- [SoCC 24] Pre-Warming is Not Enough: Accelerating Serverless Inference With Opportunistic Pre-Loading (CCF-B)
- [SIGCOMM 24] PPT: A Pragmatic Transport for Datacenters (CCF-A)
- [ASPLOS 24] FUYAO: DPU-enabled Direct Data Transfer for Serverless Computing (CCF-A)
- [IWQoS 23] High-throughput Sampling, Communicating and Training for Reinforcement Learning Systems (CCF-B)
- [TPDS 23] Accelerating Data Delivery of Latency-Sensitive Applications in Container Overlay Network (CCF-A)
- [SoCC 21] Scrooge: A Cost-Effective Deep Learning Inference System (CCF-B)
- [Middleware 18] Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System (CCF-B)
- [Ubicomp 16] ALPS: Accurate Landmark Positioning at City Scales (CCF-A)
- [INFOCOM 14] Critical Sensing Range for Mobile Heterogeneous Camera Sensor Networks (CCF-A)
Honors and Awards
- Qiming Scholar, Tianjin University, 2023
- Chun-Tsung Scholar (1st at SJTU), Shanghai Jiao Tong University, 2014
- Valedictorian at SEIEE, Shanghai Jiao Tong University, 2014
Teaching
- Computer Systems, TJU, 23Spring, 24Spring
- Design and Analysis of Algorithms, TJU, 23Fall
- Introduction to Internetworking, USC, 16Spring
Students
- Zhixin Zhao (PhD, 2022 - Now)1
- Liang Zheng (PhD, 2024 - Now)2
- Jiaheng Gao (MS, 2022 - Now)
- Linxuan Li (MS, 2022 - Now)
- Guotao Yang (MS, 2023 - Now)1
- Ziqi Gong (MS, 2023 - Now)
- Chen Shen (MS, 2023 - Now)
- Jingyuan Xiao (MS, 2024 - Now)
- Jinjun Yi (MS, 2024 - Now)
- Zhengchao Wang (MS, 2024 - Now)
- Tao Wang (MS, 2024 - Now)
- Wenxin Zhu (BS, 2023 - Now)
- Mingfang Ji (BS, 2023 - Now)
- Kai Zeng (BS, 2023 - Now)
- Zhenyi Zhong (BS, 2024 - Now)
- Ke Wang (BS, 2024 - Now)
- Junhao Li (BS, 2024 - Now)
- Hao Ding (BS, 2024 - Now)
Alumni
- Yingqin Chen (MS, 2024)2 -> China Mobile
- Jingyuan Xiao (BS, 2024) -> MS at TJU
- 1. co-advised with Prof. Wenyu Qu
- 2. co-advised with Prof. Keqiu Li