zwarelogo

Zware-AIcloud Intelligent Computing Control and Scheduling Platform


Technology White Paper

1. Background

With the rapid development of artificial intelligence technology, large model computing power infrastructure has become a key pillar in digital transformation, greatly empowering the digital economy.


To support AI large model training and other tasks, large-scale intelligent computing centers composed of thousands, tens of thousands, or even hundreds of thousands of GPU clusters are needed to meet the computing power demands. These computing cards must work collaboratively to provide sufficient computing power to handle and update the massive parameters in models. Faced with challenges such as ultra-large scale, numerous configurations, high performance, and fine granularity, the primary innovation points revolve around how to efficiently manage and utilize computing power resources and complete large model training tasks with high quality. The AI large model training business scheduling platform faces the following pain points:
  1. Ultra-large Scale Computing Power Scheduling: The key to providing large model training services lies in the scheduling of ultra-large scale and diverse heterogeneous computing power.
  2. Extreme Computing Power Utilization Efficiency: Ensuring task operations and improving computing power utilization through technologies such as elastic fault tolerance, resource preemption, and checkpoint resumption.
  3. Automatic Fault Detection and Alerts: Identifying and alerting on various software and hardware anomalies through node and cluster-level monitoring, assessing task operations and resource utilization.
  4. Innovative Service Model Development: Providing an open architecture, diverse computing environments, and refined operation functions to achieve efficient collaboration of computing power, data, algorithms, and models, meeting the needs of multi-level customers.

2. Product Introduction

The Zware-AICloud Intelligent Computing Control and Scheduling Platform is designed and developed for AI large model pre-training and control scheduling, ensuring the efficiency of AI large model training and control scheduling. It consolidates the ultra-large-scale computing power infrastructure with full-end intelligent computing capabilities.

The platform adopts end-to-end integrated heterogeneous distributed computing and RDMA communication frameworks, equipped with high-performance task scheduling engines, heterogeneous GPU adaptation, intelligent monitoring, and operation and maintenance control capabilities. The platform includes one-click deployment of components, status observability, network card and switch control, enhanced GPU communication libraries, and hybrid scheduling of heterogeneous devices compatible with various CPUs and GPUs. Additionally, driven by data, the platform achieves comprehensive, integrated, and intelligent operation, scheduling, monitoring, and maintenance of ultra-large-scale AI computing network clusters.

The product architecture diagram is as shown below:
ai-cloud-product-architecture

3. Product Features

3.1 Core Features

1. Task Submission and Related Services

The user end provides a convenient visual interface, supporting custom settings for submitting distributed tasks, and includes built-in common computing frameworks such as PyTorch and MPI; it also offers services related to task management, storage management, and image management.

2. Large-Scale Distributed Scheduling:

Equipped with a powerful distributed scheduling engine, it supports the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios, ensuring high-quality and stable task completion and improving computing efficiency.

3. Heterogeneous and Long-Distance Control Scheduling:

It enables unified cluster management of heterogeneous computing power, allowing users to easily view the real-time usage efficiency of various resources within the cluster, such as accelerators (GPUs, etc.), CPUs, and memory. It supports long-distance computing power scheduling, forming cross-data-center large model training. Through an intuitive interface, users can comprehensively understand the cluster's load status, providing data support for resource optimization. Based on the real-time load status of the cluster, users can flexibly perform management operations such as adding and deleting nodes and queue adjustments, ensuring efficient resource utilization and dynamic balance.

4. Automatic Fault Detection and Alerts

It supports linkage with the control and maintenance system, automatically capturing the operational status data of network devices, and comprehensively monitoring the network's operational status, achieving automatic fault detection. It supports automatic abnormal alarms and automatic equipment inspection, assisting maintenance personnel in quickly discovering, locating, and resolving faults.


4. User Value

Through the Zware-AICloud Intelligent Computing Control and Scheduling Platform, users can achieve control and scheduling of ultra-large-scale intelligent computing clusters, automatic fault tolerance, and other functions. It supports various deep learning frameworks such as PyTorch, Megatron, and VLLM. This platform provides significant power for product development, critical business decisions, and scientific breakthroughs in many industries. Currently, it has been deployed in multiple large-scale intelligent computing clusters, with a single cluster's computing power exceeding 2000P.


4.1 Large-Scale Distributed Scheduling

Built-in powerful distributed scheduling engine, supporting the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios, ensuring high-quality and stable task completion and improving computing efficiency.

4.2  Multidimensional Heterogeneous Computing Power Scheduling

Support for Unified Scheduling of Heterogeneous Computing Power. The platform provides a unified primitive interface for the collective communication library in AI, enabling targeted adaptation to different manufacturers' GPU computing power within the library. This allows AI models to be easily trained on different GPU cards, avoiding the need for specific adaptations. It also enables collaborative training with GPUs from multiple manufacturers, making full use of existing computing power resources.
In terms of long-distance aspects, the platform automatically adapts to the collaborative training of AI intelligent computing clusters across different distances through software-defined methods. For long-distance congestion control, the platform uses congestion-aware PFC (Priority Flow Control) linked point-killing deceleration functions, which effectively improve overall transmission performance over long distances and enhance the overall efficiency of large model training.

4.3 Automatic Fault Prediction and Recovery

The platform uses real-time monitoring and predictive maintenance, enabling the system to detect potential problems in advance and take measures to reduce system failures and downtime, thereby improving service reliability and stability. Based on historical data and real-time performance indicators, the system can predict and identify potential fault points and take measures in advance to avoid business interruptions. Once an issue is detected, the automated fault resolution and business recovery process will quickly activate, reducing system downtime and ensuring business continuity and data integrity.

4.4 High-Efficiency Congestion Control Capability

The platform can achieve automatic parameter tuning in DCQCN congestion control, adopting a distributed architecture that supports dynamic scaling. It can automatically collect data from each node and automatically issue parameter configurations to each node. Using an improved annealing algorithm, it can meet the requirements for automatic parameter tuning of large-scale networks. It implements load balancing encoding technology in congestion control, orchestrating synchronized data flow for large model training across the entire network level. This ensures balanced traffic among all nodes in the network, fully utilizing the network bandwidth, improving overall data synchronization performance, and achieving the goal of enhancing the efficiency of large model training.

  back to homepage