TACC is an efficient cluster management solution optimized for
       
machine learning applications in large-scale GPU clusters

TACC supports research work on a wide range of machine learning applications with high-performance and scalable infrastructure on both software- and hardware-level. TACC is tailored to the workflow of machine learning application, providing you with a more efficient process of managing, deploying and scaling compute-intensive machine learning jobs in a computing cluster.

Unified Interface

The unified interface processes user job profiles for the machine learning cluster and offers tools for job monitoring, output retrieval, and log streaming. It enables job submission from local environments and incorporates environment provisioning scripts before scheduler submission.

ML Systems

ML frameworks optimize model development with state-of-the-art parallelization and distributed training techniques. These strategies improve data processing and algorithm execution, enabling efficient handling of complex computations and large datasets.

Cluster Scheduler

The cluster scheduler optimizes resource allocation in a machine learning cluster by analyzing job performance factors like completion time, resource utilization, and efficiency. This critical component boosts cluster efficiency and throughput.

AI-centric Networking

Our research enhances AI workloads by efficiently managing the transport of large models and utilizing SmartNICs for compute offloading. This setup optimizes data flow and reduces latency, ensuring high performance and scalability for AI applications.

Research publications
TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks
Kaiqiang Xu, Xinchen Wan, Hao Wang, Zhenghang Ren, Xudong Liao, Decang Sun, Chaoliang Zeng, Kai Chen
Towards Domain-Specific Network Transport for Distributed DNN Training (NSDI '24)
Hao Wang, Han Tian, Jingrong Chen, Xinchen Wan, Jiacheng Xia, Gaoxiong Zeng, Wei Bai, Junchen Jiang, Yong Wang, Kai Chen

View More Publications


TACC Clusters at HKUST

2

TACC-managed clusters

397

Active TACC users since 2021/05

24,303

Task process on TACC clusters

235,739

GPU hours used for ML tasks

TACC at HKUST

At HKUST, TACC manages clusters of over 160 GPU cards for research and education in machine learning with open access to the research community.

Compared to the beginning of 2023, TACC in 2023 has seen an 84% increase in active users to 397 and a 115% rise in processed ML tasks to 24,303.

TACC supports over 40 research projects and has seen so far 22 citations at top conferences including SIGMOD, KDD, CVPR, and UbiComp.

By researchers, for researchers

Embrace the TACC solution in your cluster, an advanced approach to cluster management and task handling that enhances efficiency and reliability.
Our research-backed solution includes comprehensive hardware monitoring, streamlined maintenance, and efficient job scheduling and execution.

High Stability

TACC enhances operational reliability by providing continuous 24/7 hardware monitoring, robust containerization options, and equitable and efficient job scheduling mechanisms.

Enhanced Usability

Task provisioning and management are effortless with TACC. Users can submit and monitor tasks via command line, web UI, or API, making the system highly accessible.

Maximized Performance

TACC incorporates state-of-the-art ML systems and network technologies from the academic sector, tailored for maximizing performance and scalability in AI applications.