Towards High-Performance, Privacy-Preserving AI Computing
State-of-the-Art Research & Applications
System for Machine Learning
Scalable and Efficient GNN Training for Large Graphs
Authors: Xinchen Wan, Kaiqiang Xu, Xudong Liao, Yilun Jin, Kai Chen, Xin Jin
We present G3, a distributed system that can efficiently train GNNs with near-linear scalability over billion-edge graphs. G3 introduces GNN hybrid parallelism to scale out GNN training by sharing intermediate results peer-to-peer in fine granularity. It leverages locality-aware iterative partitioning and multi-level pipeline scheduling to exploit acceleration opportunities.
Egeria: An Efficient DNN Training System with Knowledge-Guided Layer Freezing
Authors: Yiding Wang, Decang Sun, Fan Lai, Mosharaf Chowdhury, Kai Chen
We design Egeria, a knowledge-guided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layer's training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication. Our reference model is generated on the fly using quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead.
Privacy-Preserving Computing
Sphinx: Enabling Privacy-Preserving Online Learning over the Cloud
Authors: Han Tian, Chaoliang Zeng, Zhenghang Ren, Di Chai, Kai Chen, Qiang Yang
We present Sphinx, a privacy-preserving online learning system that strikes a balance between model performance, computational efficiency, and privacy preservation. At its core, Sphinx combines homomorphic encryption and differential privacy reciprocally to maintain the model with most of its parameters as plaintexts, enabling fast training and inference protocol designs.
FedEval: A Benchmark System with a Comprehensive Evaluation Model for Federated Learning
Authors: Di Chai, Leye Wang, Junxue Zhang, Kai Chen, Qiang Yang
We propose a comprehensive evaluation framework for FL systems. Specifically, we first introduce the PRACT model, which defines five metrics that cannot be excluded from FL evaluation, including Privacy, Robustness, Accuracy, Communication, and Time efficiency. Then we design and implement a benchmarking system called FedEval, which enables the systematic evaluation and comparison of existing works under consistent experimental conditions.
Network for Machine Learning
Domain-Specific Communication Optimization for Distributed DNN Training
Authors: Hao Wang, Jingrong Chen, Xinchen Wan, Han Tian, Jiacheng Xia, Gaoxiong Zeng, Kai Chen, Wei Bai, Junchen Jiang
We present DLCP, a network substrate that speeds up DNN training by fully embracing several unique characteristics of deep learning. Compared to prior work in this space, DLCP integrates three simple-yet-effective techniques to form a multi-layered protection against long tail latency caused by transient packet drops and queueing.