Technology White Paper
With the rapid development of artificial intelligence technology, large model computing power infrastructure has become a key pillar in digital transformation, greatly empowering the digital economy.
The user end provides a convenient visual interface, supporting custom settings for submitting distributed tasks, and includes built-in common computing frameworks such as PyTorch and MPI; it also offers services related to task management, storage management, and image management.
Equipped with a powerful distributed scheduling engine, it supports the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios, ensuring high-quality and stable task completion and improving computing efficiency.
It enables unified cluster management of heterogeneous computing power, allowing users to easily view the real-time usage efficiency of various resources within the cluster, such as accelerators (GPUs, etc.), CPUs, and memory. It supports long-distance computing power scheduling, forming cross-data-center large model training. Through an intuitive interface, users can comprehensively understand the cluster's load status, providing data support for resource optimization. Based on the real-time load status of the cluster, users can flexibly perform management operations such as adding and deleting nodes and queue adjustments, ensuring efficient resource utilization and dynamic balance.
It supports linkage with the control and maintenance system, automatically capturing the operational status data of network devices, and comprehensively monitoring the network's operational status, achieving automatic fault detection. It supports automatic abnormal alarms and automatic equipment inspection, assisting maintenance personnel in quickly discovering, locating, and resolving faults.
Through the Zware-AICloud Intelligent Computing Control and Scheduling Platform, users can achieve control and scheduling of ultra-large-scale intelligent computing clusters, automatic fault tolerance, and other functions. It supports various deep learning frameworks such as PyTorch, Megatron, and VLLM. This platform provides significant power for product development, critical business decisions, and scientific breakthroughs in many industries. Currently, it has been deployed in multiple large-scale intelligent computing clusters, with a single cluster's computing power exceeding 2000P.
Built-in powerful distributed scheduling engine, supporting the scheduling and management of computing power resources at the scale of thousands and tens of thousands of cards. Through various scheduling methods such as priority scheduling, reclamation strategies, preemptive scheduling, and fault-tolerant task restart scheduling, it meets complex needs in different application scenarios, ensuring high-quality and stable task completion and improving computing efficiency.
Support for Unified Scheduling of Heterogeneous Computing Power. The
platform provides a unified primitive interface for the collective
communication library in AI, enabling targeted adaptation to different
manufacturers' GPU computing power within the library. This allows AI
models to be easily trained on different GPU cards, avoiding the need
for specific adaptations. It also enables collaborative training with
GPUs from multiple manufacturers, making full use of existing
computing power resources.
In terms of long-distance aspects, the platform automatically adapts
to the collaborative training of AI intelligent computing clusters
across different distances through software-defined methods. For
long-distance congestion control, the platform uses congestion-aware
PFC (Priority Flow Control) linked point-killing deceleration
functions, which effectively improve overall transmission performance
over long distances and enhance the overall efficiency of large model
training.
The platform uses real-time monitoring and predictive maintenance, enabling the system to detect potential problems in advance and take measures to reduce system failures and downtime, thereby improving service reliability and stability. Based on historical data and real-time performance indicators, the system can predict and identify potential fault points and take measures in advance to avoid business interruptions. Once an issue is detected, the automated fault resolution and business recovery process will quickly activate, reducing system downtime and ensuring business continuity and data integrity.
The platform can achieve automatic parameter tuning in DCQCN congestion control, adopting a distributed architecture that supports dynamic scaling. It can automatically collect data from each node and automatically issue parameter configurations to each node. Using an improved annealing algorithm, it can meet the requirements for automatic parameter tuning of large-scale networks. It implements load balancing encoding technology in congestion control, orchestrating synchronized data flow for large model training across the entire network level. This ensures balanced traffic among all nodes in the network, fully utilizing the network bandwidth, improving overall data synchronization performance, and achieving the goal of enhancing the efficiency of large model training.