Projects | TRS Project

System for Machine Learning

Scalable and Efficient GNN Training for Large Graphs

Authors: Xinchen Wan, Kaiqiang Xu, Xudong Liao, Yilun Jin, Kai Chen, Xin Jin

We present G3, a distributed system that can efficiently train GNNs with near-linear scalability over billion-edge graphs. G3 introduces GNN hybrid parallelism to scale out GNN training by sharing intermediate results peer-to-peer in fine granularity. It leverages locality-aware iterative partitioning and multi-level pipeline scheduling to exploit acceleration opportunities.

Egeria: An Efficient DNN Training System with Knowledge-Guided Layer Freezing

Authors: Yiding Wang, Decang Sun, Fan Lai, Mosharaf Chowdhury, Kai Chen

We design Egeria, a knowledge-guided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layer's training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication. Our reference model is generated on the fly using quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead.

Privacy-Preserving Computing

Sphinx: Enabling Privacy-Preserving Online Learning over the Cloud

Authors: Han Tian, Chaoliang Zeng, Zhenghang Ren, Di Chai, Kai Chen, Qiang Yang

We present Sphinx, a privacy-preserving online learning system that strikes a balance between model performance, computational efficiency, and privacy preservation. At its core, Sphinx combines homomorphic encryption and differential privacy reciprocally to maintain the model with most of its parameters as plaintexts, enabling fast training and inference protocol designs.

FedEval: A Benchmark System with a Comprehensive Evaluation Model for Federated Learning

Authors: Di Chai, Leye Wang, Junxue Zhang, Kai Chen, Qiang Yang

We propose a comprehensive evaluation framework for FL systems. Specifically, we first introduce the PRACT model, which defines five metrics that cannot be excluded from FL evaluation, including Privacy, Robustness, Accuracy, Communication, and Time efficiency. Then we design and implement a benchmarking system called FedEval, which enables the systematic evaluation and comparison of existing works under consistent experimental conditions.

Network for Machine Learning

Domain-Specific Communication Optimization for Distributed DNN Training

Authors: Hao Wang, Jingrong Chen, Xinchen Wan, Han Tian, Jiacheng Xia, Gaoxiong Zeng, Kai Chen, Wei Bai, Junchen Jiang

We present DLCP, a network substrate that speeds up DNN training by fully embracing several unique characteristics of deep learning. Compared to prior work in this space, DLCP integrates three simple-yet-effective techniques to form a multi-layered protection against long tail latency caused by transient packet drops and queueing.

RGC Theme-based Research Scheme