zwarelogo

Zware-AIOC Intelligent Operation and Maintenance Control Platform


Technical White Paper - Zettabyte Holdings, Inc.

1. Background

With the rapid development of artificial intelligence (AI) technology, especially generative AI, intelligent computing centers, which serve as centralized hubs for computing resources and data processing, are gradually becoming key infrastructures that drive technological innovation and support digital transformation.

Intelligent computing centers are responsible not only for large-scale data processing and highly complex computational tasks but also for integrating advanced machine learning and deep learning algorithms to provide strong intelligent support for various industries. On October 8, 2023, the Ministry of Industry and Information Technology, in conjunction with five other departments, jointly released the "High-Quality Development Action Plan for Computing Power Infrastructure". This plan proposes that new computing power infrastructure will integrate information computing power, network carrying capacity, and data storage capacity. It will not only achieve centralized computation, storage, and transmission of information but also demonstrate advanced characteristics such as intelligence, security, reliability, and green low-carbon.

To support AI large model training and other services, large-scale intelligent computing centers comprising thousands, tens of thousands, or even hundreds of thousands of GPU clusters are needed to meet the demand for computing power. The control and operation of such large-scale intelligent computing centers are developing towards cloud-native and intelligent directions. To address challenges such as ultra-large configurations, ultra-fine granularity, ultra-large scale, and ultra-long distances, it is necessary to have key capabilities such as end-to-end resource collaborative management, automated deployment, performance optimization, and fault monitoring, to overcome the island effect brought about by the separation of computing and network operations.

Against this background, the control, operation, and maintenance system of intelligent computing centers becomes crucial to ensure the stable, efficient, and secure operation of these centers.

2. Product Introduction

2.1 Product Overview

Zware-AIOC is an intelligent operation and maintenance control platform based on intent, developed through research on network automation and network computing technology systems. It supports one-click deployment, self-testing collection, and monitoring of operational status, identifying abnormal states of business and equipment for intelligent analysis and dynamic adjustment. This enables comprehensive automation and intelligent control operations for intelligent computing centers.
The system monitors the operational status, network traffic, network topology, and other relevant information of equipment such as computing nodes, storage nodes, switches, and servers in the intelligent computing center. This allows for continuous tracking of key indicators such as deep connection utilization (DCQCN), packet error rate, GPU drop rate, and network congestion. It also provides timely notifications of various anomalies in the system via email to maintenance personnel, helping to quickly respond and reduce the continuous impact of network failures.
The system can be managed and monitored through a WEB interface or a CLI command line management interface. The WEB interface enables network topology discovery, verification, anomaly alerts, automated configuration and configuration verification, super-visualized monitoring, setting of automatic inspection strategies, error warnings, and the use of application tools. Additionally, user account management and specific network device information viewing can be conducted through the WEB interface.

2.2 System Architecture

The system architecture from top to bottom includes the user interface layer, the core logic layer, and the network data access layer. The southbound interface of the core logic layer connects to the data access layer of device nodes such as switches and servers. The northbound interface provides unified standard services and interfaces, supporting both CLI command line and WEB user interface interactions. The overall architecture of the system is shown in the diagram below:
System Architecture
The Zware-AIOC intelligent control platform consists of two main system components: the core logic layer's main software ZwareMaster and the data acquisition agent module ZwareAgent on switches and servers. In the intelligent computing network, the most common deployment method for the AINOC intelligent operation and maintenance control platform is to deploy ZwareMaster onto the management server. The management server connects to switches and servers in the intelligent computing network through the management network. The ZwareAgent data acquisition agent module resides on nodes such as GPU servers, enabling the acquisition and monitoring of the entire network information and the linkage management of servers.

3. Product Features

3.1 Core Features

The Zware-AIOC intelligent operation and maintenance control platform adopts a heterogeneous distributed computing framework, providing data-driven, comprehensive, and integrated intelligent operations, maintenance, and monitoring for intelligent computing centers.

Ainoc core features

Core Features:

  1. Network Topology Discovery and Verification: Automatically discovers network devices, including switches and servers, and their topology connections, generating a network topology view. It can automatically compare the generated network topology with the planned topology according to specified policies, and provide anomaly alerts for inconsistencies.
  2. Centralized Management of Network Integrated Facilities: The control and maintenance system covers detailed information and operational status of all network devices, providing one-stop query services to achieve centralized management of network infrastructure.
  3. Automated Configuration: The system offers embedded device configuration templates, automatically adapting to device functions. The generated configurations can be deployed to all devices with one click, ensuring configuration accuracy and reducing manual workload.
  4. Super-Visualized Monitoring: Automatically captures operational status data and dynamic traffic data of network devices, displaying highly visualized network operation views and dynamic traffic views. It provides comprehensive monitoring of network operational status and effectively predicts network traffic trends to support decision-making.
  5. Error Alerts: The control and maintenance system can automatically inspect and diagnose controlled devices by setting automatic inspection policies, providing timely error messages of various levels, and promptly restoring faults.
  6. Host-Side Integrated Network Functions: Host servers integrate high-performance communication libraries and embedded intelligent traffic control mechanisms to achieve ultra-lossless and ultra-balanced traffic. It also supports high-performance storage based on ROCE on the storage side.

4.    Application Scenarios

Currently, intelligent computing centers are characterized primarily by GPU clusters. These GPU clusters include in-rack interconnections, data center interconnections, and cross-data center interconnections.
Ainoc application scenario
Large-scale intelligent computing centers are characterized by having over 1,000 GPU cards. The diagram below shows a typical network layout of a large-scale intelligent computing center that supports 2,048 GPU cards, which can be used for AI large model training.
Ainoc maintenance control platform

The intelligent operation and maintenance control platform is applied to intelligent computing centers, connecting to computing power servers/storage servers, network equipment, etc., to achieve automated and intelligent control and maintenance of computing nodes, storage nodes, and networks.

5.    User Value

The Zware-AIOC intelligent operation and maintenance control platform demonstrates powerful features and efficient operational capabilities in the deployment scenarios of intelligent computing centers. Through fine-grained hardware resource management, integrated end-to-end control, equipment status visualization, flexible business deployment, and efficient fault management and business recovery, the platform provides a stable, reliable, and efficient operating environment for intelligent computing centers, supporting the development of AI large model training and other services. The platform has been deployed in nearly 10 GPU intelligent computing centers with more than 1,000 cards, used in production environments to support AI large model training and other services.

5.1 Innovation Points

1. End-to-End Integrated Control Technology:

End-to-end integrated control technology can significantly improve the operational efficiency and security of intelligent computing centers. By unifying the management of computing, storage, and network resources, the platform can achieve optimized resource allocation and efficient scheduling, reducing the complexity of maintenance and operations. Additionally, end-to-end integrated technology helps enhance data transmission efficiency and reduce network latency, providing stable and reliable support for AI large model training and other services.

2. Efficient Fault Prediction and Automated Fault Recovery Technology

The platform has efficient fault prediction and automated fault recovery capabilities. Based on historical data and real-time performance indicators, the platform can predict and identify potential fault points and take proactive measures to avoid business interruptions. Once an issue is detected, the automated fault resolution and business recovery processes will be quickly initiated, reducing system downtime and ensuring business continuity and data integrity.

3. Network Optimization Capability

Optimization technologies include AI-DCQCN and Deep-Routing.
AI-DCQCN involves the analysis and configuration of over 30 parameters for RoCE/N CCL, achieving intelligent parameter tuning. With principal component analysis and intelligent optimization algorithms at its core, combined with microsecond-level connection status, it utilizes human-machine collaboration to obtain the most suitable and interpretable configuration for business scenarios.

Ainoc optimization diagram

The static ECMP (Equal-Cost Multi-Path) of traditional networks is not compatible with the load of large models, resulting in link conflicts, traffic imbalances, and low utilization rates. Deep-Routing deep load balancing technology enhances the topology awareness of GPUs through end-to-end integration. It divides the logical links of the endpoint communication library and the data granularity of switch nodes, achieving multidimensional deep load balancing.

Equal Cost Multi Path

4. Heterogeneous Control Capability

Heterogeneous control has significant integration implications for the construction of domestic intelligent centers. The platform has already integrated with a range of domestic GPUs, including NVIDIA series GPUs, Huawei, TianShu, and Cambricon. It also interfaces with commercial data center switches such as Cisco, H3C, and Ruijie, and has integrated with data center switches based on SONiC systems using both international and domestic switching chips.

5.2 Application Value

In the case of the large-scale intelligent computing center, the intelligent operation and maintenance control platform demonstrates the following application values for the rapidly developing intelligent computing infrastructure:


  back to homepage