TACC | iSING Lab HKUST | Scalable AI Infrastructure Designed for Evolving Machine Learning Research

Scalable AI Infrastructure
Designed for Evolving
Machine Learning Research

Apply for Access ASPLOS '25 Paper Github

TACC is an efficient AI computing infrastructure optimized for
machine learning applications in large-scale GPU clusters

TACC supports research work on a wide range of machine learning applications with high-performance and scalable infrastructure on both software- and hardware-level. TACC is tailored to the workflow of machine learning application, providing you with a more efficient process of managing, deploying and scaling compute-intensive machine learning jobs in a computing cluster.

Unified Interface

The unified interface processes user job profiles for the machine learning cluster and offers tools for job monitoring, output retrieval, and log streaming. It enables job submission from local environments and incorporates environment provisioning scripts before scheduler submission.

ML Systems

ML frameworks optimize model development with state-of-the-art parallelization and distributed training techniques. These strategies improve data processing and algorithm execution, enabling efficient handling of complex computations and large datasets.

Cluster Scheduler

The cluster scheduler optimizes resource allocation in a machine learning cluster by analyzing job performance factors like completion time, resource utilization, and efficiency. This critical component boosts cluster efficiency and throughput.

AI-centric Networking

Our research enhances AI workloads by efficiently managing the transport of large models and utilizing SmartNICs for compute offloading. This setup optimizes data flow and reduces latency, ensuring high performance and scalability for AI applications.

Research publications

ASPLOS '25: Design and Operation of Shared Machine Learning Clusters on Campus

Kaiqiang Xu, Decang Sun, Hao Wang, Zhenghang Ren, Xinchen Wan, Xudong Liao, Zilong Wang, Junxue Zhang, Kai Chen

TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks

Kaiqiang Xu, Xinchen Wan, Hao Wang, Zhenghang Ren, Xudong Liao, Decang Sun, Chaoliang Zeng, Kai Chen

NSDI ‘25: GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters

Kaiqiang Xu; Decang Sun; Han Tian; Junxue Zhang; Kai Chen.

EuroSys ‘25: Achieving Fairness Generalizability for Learning-based Congestion Control with Jury

Han Tian; Xudong Liao; Decang Sun; Chaoliang Zeng; Yilun Jin; Junxue Zhang; Xinchen Wan; Zilong Wang; Yong Wang; Kai Chen.

EuroSys ‘24: Astraea: Towards Fair and Efficient Learning-based Congestion Control

Xudong Liao*; Han Tian*(co-first); Chaoliang Zeng; Xinchen Wan; Kai Chen.

NSDI ‘24: Towards Domain-Specific Network Transport for Distributed DNN Training

Hao Wang; Han Tian; Jingrong Chen; Xinchen Wan; Jiacheng Xia; Gaoxiong Zeng; Wei Bai; Junchen Jiang; Yong Wang; Kai Chen.

IEEE S&P ‘22: Sphinx: Enabling Privacy-Preserving Online Learning over the Cloud

Han Tian, Chaoliang Zeng; Zhenghang Ren; Di Chai; Junxue Zhang; Kai Chen; Qiang Yang.

CoNEXT ‘22: Spine: An Efficient DRL-based Congestion Control with Ultra-low Overhead

Han Tian, Xudong Liao, Chaoliang Zeng, Kai Chen.

View More Publications

TACC Clusters at HKUST

2

TACC-managed clusters

397

Active TACC users since 2021/05

24,303

Task process on TACC clusters

235,739

GPU hours used for ML tasks

Call for Pioneers

Early-adopter Application is now open! Join now and boost your AI research

Apply for free access to TACC@HKUST

User Login

Docs and Turtorials

Comprehensive docs and examples available for you to get started

Contact TACC Team

Get in touch with TACC operation team at HKUST to give feedbacks

TACC at HKUST

At HKUST, TACC clusters have over 160 GPU cards for research and education in machine learning with open access to the research community.

Compared to the beginning of 2023, TACC in 2023 has seen an 84% increase in active users to 397 and a 115% rise in processed ML tasks to 24,303.

TACC supports over 40 research projects and has seen so far 22 citations at top conferences including SIGMOD, KDD, CVPR, and UbiComp.

By researchers, for researchers

TACC is powered by SING AI Cloud infrastructure solution, an advanced approach to cluster management and task handling that enhances efficiency and reliability.
Our research-backed solution includes comprehensive hardware monitoring, streamlined maintenance, and efficient job scheduling and execution.

High Stability

TACC enhances operational reliability by providing continuous 24/7 hardware monitoring, robust containerization options, and equitable and efficient job scheduling mechanisms.

Enhanced Usability

Task provisioning and management are effortless with TACC. Users can submit and monitor tasks via command line, web UI, or API, making the system highly accessible.

Maximized Performance

TACC incorporates state-of-the-art ML systems and network technologies from the academic sector, tailored for maximizing performance and scalability in AI applications.

Deploy TACC in Your Own Cluster

Read Our ASPLOS '25 Paper

TACC Team

We are a group of system and networking PhDs dedicated to advancing the efficiency, scalability, and sustainability of large-scale computing infrastructures, with a focus on AI workloads, GPU cluster management, and cloud optimization.

Contact Us

About iSING Lab

Funded by RGC Theme-based Research

Term of Service (TACC @HKUST)

TACC is an efficient AI computing infrastructure optimized for machine learning applications in large-scale GPU clusters

TACC Clusters at HKUST

TACC is an efficient AI computing infrastructure optimized for
machine learning applications in large-scale GPU clusters