Data Center and Machine Learning Research

Report on Current Developments in Data Center and Machine Learning Research

General Direction of the Field

The recent advancements in data center management and machine learning (ML) operations reflect a concerted effort to optimize performance, enhance security, and streamline resource allocation. The field is moving towards more intelligent, measurement-based approaches to resource management in data centers, leveraging network monitoring and scheduling frameworks to ensure performance guarantees for cloud applications. This is particularly evident in the integration of Software-Defined Networking (SDN) and programmable dataplanes, which facilitate more dynamic and responsive network control.

In the realm of distributed machine learning, there is a growing emphasis on standardizing collective algorithms and improving communication efficiency. This includes the development of frameworks that optimize small message communication and exploit shared memory interfaces like Compute Express Link (CXL) to reduce latency and enhance security. The trend also extends to the scaling of Deep Neural Networks (DNNs) on heterogeneous GPU clusters, where systems like Poplar are designed to maximize computational efficiency across diverse hardware configurations.

Machine Learning Operations (MLOps) continue to evolve, focusing on automating the lifecycle of ML models from experimentation to deployment and monitoring. This discipline aims to integrate development and production environments more seamlessly, ensuring that ML models are efficiently managed and monitored throughout their lifecycle.

Noteworthy Developments

  • Telepathic Datacenters: Fast RPCs using Shared CXL Memory: Introduces RPCool, a framework that significantly reduces round-trip latency in RPCs by leveraging CXL's shared memory capabilities, offering a 1.93x and 7.2x improvement over state-of-the-art RDMA and CXL-based mechanisms, respectively.
  • PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters: PAL scheduler improves job completion time by 42%, cluster utilization by 28%, and makespan by 47% by accounting for performance variability in ML workloads, enhancing resource utilization and load balancing in GPU clusters.
  • Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters: Poplar system achieves a training throughput improvement of 1.02-3.92x by optimizing the utilization of heterogeneous GPUs, demonstrating significant advancements in distributed DNN training efficiency.

These developments highlight the innovative strides being made in optimizing data center operations and machine learning workflows, setting new benchmarks for performance and efficiency in these critical areas.

Sources

Measurement-based Resource Allocation and Control in Data Centers: A Survey

Demystifying the Communication Characteristics for Distributed Transformer Models

Towards a Standardized Representation for Deep Learning Collective Algorithms

Experimentation, deployment and monitoring Machine Learning models: Approaches for applying MLOps

Telepathic Datacenters: Fast RPCs using Shared CXL Memory

Security Evaluation in Software-Defined Networks

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters

KS+: Predicting Workflow Task Memory Usage Over Time

Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI

CloudSim 7G: An Integrated Toolkit for Modeling and Simulation of Future Generation Cloud Computing Environments