TACC is an efficient AI computing infrastructure optimized for
       
machine learning applications in large-scale GPU clusters

TACC supports research work on a wide range of machine learning applications with high-performance and scalable infrastructure on both software- and hardware-level. TACC is tailored to the workflow of machine learning application, providing you with a more efficient process of managing, deploying and scaling compute-intensive machine learning jobs in a computing cluster.

Unified Interface

The unified interface processes user job profiles for the machine learning cluster and offers tools for job monitoring, output retrieval, and log streaming. It enables job submission from local environments and incorporates environment provisioning scripts before scheduler submission.

ML Systems

ML frameworks optimize model development with state-of-the-art parallelization and distributed training techniques. These strategies improve data processing and algorithm execution, enabling efficient handling of complex computations and large datasets.

Cluster Scheduler

The cluster scheduler optimizes resource allocation in a machine learning cluster by analyzing job performance factors like completion time, resource utilization, and efficiency. This critical component boosts cluster efficiency and throughput.

AI-centric Networking

Our research enhances AI workloads by efficiently managing the transport of large models and utilizing SmartNICs for compute offloading. This setup optimizes data flow and reduces latency, ensuring high performance and scalability for AI applications.

Research publications
ASPLOS '25: Design and Operation of Shared Machine Learning Clusters on Campus
Kaiqiang Xu, Decang Sun, Hao Wang, Zhenghang Ren, Xinchen Wan, Xudong Liao, Zilong Wang, Junxue Zhang, Kai Chen
TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks
Kaiqiang Xu, Xinchen Wan, Hao Wang, Zhenghang Ren, Xudong Liao, Decang Sun, Chaoliang Zeng, Kai Chen

View More Publications


TACC Clusters at HKUST

2

TACC-managed clusters

397

Active TACC users since 2021/05

24,303

Task process on TACC clusters

235,739

GPU hours used for ML tasks

TACC at HKUST

At HKUST, TACC clusters have over 160 GPU cards for research and education in machine learning with open access to the research community.

Compared to the beginning of 2023, TACC in 2023 has seen an 84% increase in active users to 397 and a 115% rise in processed ML tasks to 24,303.

TACC supports over 40 research projects and has seen so far 22 citations at top conferences including SIGMOD, KDD, CVPR, and UbiComp.

By researchers, for researchers

TACC is powered by SING AI Cloud infrastructure solution, an advanced approach to cluster management and task handling that enhances efficiency and reliability.
Our research-backed solution includes comprehensive hardware monitoring, streamlined maintenance, and efficient job scheduling and execution.

High Stability

TACC enhances operational reliability by providing continuous 24/7 hardware monitoring, robust containerization options, and equitable and efficient job scheduling mechanisms.

Enhanced Usability

Task provisioning and management are effortless with TACC. Users can submit and monitor tasks via command line, web UI, or API, making the system highly accessible.

Maximized Performance

TACC incorporates state-of-the-art ML systems and network technologies from the academic sector, tailored for maximizing performance and scalability in AI applications.

TACC Team

We are a group of system and networking PhDs dedicated to advancing the efficiency, scalability, and sustainability of large-scale computing infrastructures, with a focus on AI workloads, GPU cluster management, and cloud optimization.