Technical White Paper - Zettabyte Holdings, Inc.
The Zware-AIOC intelligent operation and maintenance control platform adopts a heterogeneous distributed computing framework, providing data-driven, comprehensive, and integrated intelligent operations, maintenance, and monitoring for intelligent computing centers.
The Zware-AIOC intelligent operation and maintenance control platform demonstrates powerful features and efficient operational capabilities in the deployment scenarios of intelligent computing centers. Through fine-grained hardware resource management, integrated end-to-end control, equipment status visualization, flexible business deployment, and efficient fault management and business recovery, the platform provides a stable, reliable, and efficient operating environment for intelligent computing centers, supporting the development of AI large model training and other services. The platform has been deployed in nearly 10 GPU intelligent computing centers with more than 1,000 cards, used in production environments to support AI large model training and other services.
End-to-end integrated control technology can significantly improve the operational efficiency and security of intelligent computing centers. By unifying the management of computing, storage, and network resources, the platform can achieve optimized resource allocation and efficient scheduling, reducing the complexity of maintenance and operations. Additionally, end-to-end integrated technology helps enhance data transmission efficiency and reduce network latency, providing stable and reliable support for AI large model training and other services.
The platform has efficient fault prediction and automated fault recovery capabilities. Based on historical data and real-time performance indicators, the platform can predict and identify potential fault points and take proactive measures to avoid business interruptions. Once an issue is detected, the automated fault resolution and business recovery processes will be quickly initiated, reducing system downtime and ensuring business continuity and data integrity.
Optimization technologies include AI-DCQCN and Deep-Routing.
AI-DCQCN involves the analysis and configuration of over 30 parameters
for RoCE/N CCL, achieving intelligent parameter tuning. With principal
component analysis and intelligent optimization algorithms at its
core, combined with microsecond-level connection status, it utilizes
human-machine collaboration to obtain the most suitable and
interpretable configuration for business scenarios.
The static ECMP (Equal-Cost Multi-Path) of traditional networks is not compatible with the load of large models, resulting in link conflicts, traffic imbalances, and low utilization rates. Deep-Routing deep load balancing technology enhances the topology awareness of GPUs through end-to-end integration. It divides the logical links of the endpoint communication library and the data granularity of switch nodes, achieving multidimensional deep load balancing.
In the case of the large-scale intelligent computing center, the intelligent operation and maintenance control platform demonstrates the following application values for the rapidly developing intelligent computing infrastructure: