Frequently Asked Questions

Products

zPLATFORM™

What is zPLATFORM?

zPLATFORM is the cockpit of a high-performance vehicle, with all the buttons and knobs that are required to operate the vehicle. While zWARE monitors and optimizes everything happening under the hood, zPLATFORM is what compiles raw capability into a GUI that is usable, measurable, and monetizable. It turns complex GPU infrastructure into a business ready platform by providing a clean, intuitive interface for accessing resources, managing workloads, and running AI services. With built-in user management, metering, and billing, zPLATFORM allows organizations to operate, govern, and monetize GPU resources with precision, so teams can move faster, reduce friction, and extract more value from the same infrastructure.

Why do customers need zPLATFORM?

Managing GPUs and AI workloads across resource groups can quickly become complex and inefficient. zPLATFORM simplifies this by unifying resource management, monitoring and deployment into a single, intuitive interface. It allows teams to easily procure /access GPU clusters, allocate resources dynamically, and run AI services with full visibility and control. The result: higher utilization, smoother operations, and faster time-to-insight across your AI infrastructure. Overall, sovereigns and enterprises choose Zettabyte because we deliver true neutrality, full ownership, and complete operational independence.

How does zPLATFORM fit into my existing infrastructure?

zPLATFORM typically sits above the Kubernetes layer, acting as the PaaS and business-logic layer. Customers can adopt zPLATFORM in two ways: (1) by integrating with customers' existing Kubernetes environment or (2) by deploying zPLATFORM bundled with Kubernetes for a turnkey setup. No changes to customers underlying GPU hardware are required. If customers do not have an infrastructure stack, our zWARE product can be deployed to bridge the gap, which is suited for on-premises and multisite environments where control, efficiency, and operational reliability are required.

How does zPLATFORM improve usability?

zPLATFORM improves usability by providing guided workflows, job templates, clean dashboards, and simplified access controls, reducing onboarding time and enabling teams to use GPU clusters without deep infrastructure knowledge. zPLATFORM and Zettabyte allows organizations to bring systems online quickly while maintaining control and operational continuity.

How does zPLATFORM monetize my resources better?

Through zPLATFORM’s built-in usage metering, billing integrations, quota tools, and tenant management features, customers can unlock the full value of your resources. Operators can monetize GPU as a service with precise tracking and billing, creating new revenue opportunities, while enterprises gain clear, transparent control to allocate and optimize resources across teams and projects, ensuring maximum efficiency, visibility, and productivity across the organization.

Can enterprises use zPLATFORM without GPU expertise?

Yes. zPLATFORM provides a cloud-like experience, simply login, select your workload, and run. No specialized GPU team is required, as the platform handles hardware management, orchestration, and performance optimization behind the scenes. zPLATFORM and Zettabyte's full product offerings allow organizations to bring systems online quickly while maintaining control and operational continuity.

Is zPLATFORM suitable for regulated or sovereign environments?

Yes. zPLATFORM is commonly adopted in environments requiring (1) complete data sovereignty (2) on-prem or in-country deployment and (3) when full ownership of the AI stack is mandatory. Overall, zPLATFORM and Zettabyte are designed for sovereign-grade deployments, where data isolation, auditability, and operational control are mandatory, not optional.

zWARE™

What is zWARE?

zWARE functions like the advanced electrical systems of a high-performance race car, delivering essential real time data to allow the driver to make the best decisions. Without this visibility, even the most skilled driver cannot perform optimally. In the same way, AIDC operators controlling multi-million-dollar GPU clusters need zWARE to see, respond, and optimize in real time.

Is zWARE an AI scheduler?

zWARE is not merely an AI scheduler, it is a comprehensive, converged control plane for GPU centric AI infrastructure. It extends well beyond job scheduling to operate as a full AI Digital Command Center integrating orchestration, observability, and operational control across the entire AI stack. zWARE incorporates ultra fine grain to include adjacent telemetry spanning compute, networking, power, cooling, and environmental signals. ZWARE also supports multi-cluster federation across heterogeneous GPU domains; performs intelligent workload-to-hardware matching based on real-time system state; and delivers continuous operator alerting and feedback loops. zWARE is designed for sovereign-grade deployments, where data isolation, auditability, and operational reliability are mandatory, not optional.

Who is zWARE designed for?

zWARE is designed for anyone that operates GPU based AI infrastructure. This includes sovereigns, enterprises, data center operators, telecos, research institutions, and AI focused companies running training, fine tuning, or large-scale inference workloads. It is suited for on-premises and multisite environments where control, efficiency, and operational reliability are required.

How do customers adopt zWARE as a GPU owner?

zWARE integrates directly with existing hardware and networking environments, without changing GPU ownership or data control. GPU owners can adopt zWARE in three ways: by deploying it on existing bare-metal or virtualized GPU clusters; by bundling it with new GPU infrastructure through Zettabyte’s OEM partners; or by operating it as a managed service supported by Zettabyte’s AI Network Operations Center. This flexibility allows organizations to bring systems online quickly while maintaining control and operational continuity.

Does zWARE replace Kubernetes?

zWARE includes Zettabyte’s own optimized Kubernetes. It incorporates: (1) an optimized Kubernetes distribution hardened and tuned for GPU intensive AI workloads; (2) a custom, AI-aware scheduler and orchestrator that operates beyond native Kubernetes abstractions; and (3) GPU-level resource awareness with fine-grained control over allocation, performance states, and operational constraints. Together, these capabilities allow zWARE to operate large-scale, high density GPU clusters, enabling deterministic workload placement, sustained performance, and reliable operations in environments where standard container orchestration alone is insufficient.

How does zWARE improve utilization?

zWARE increases effective GPU utilization by ensuring that available compute is used consistently and predictably. By reducing idle capacity, avoiding fragmentation, and responding quickly to operational issues, zWARE enables more tokens to be produced from the same infrastructure. Customers typically achieve 30–40% higher effective GPU utilization, allowing them to deliver results faster while lowering the cost per training run or inference job.

How is zWARE different from hyperscaler or open-source stacks?

zWARE is designed for organizations that need to operate AI infrastructure on their own terms. Unlike hyperscaler platforms, it preserves GPU and data ownership while providing full visibility and control across the infrastructure. Compared to open-source stacks, zWARE is production ready by design, delivering consistent performance, integrated operations, and measurable improvements in GPU utilization and throughput. This allows teams to run AI workloads more efficiently, scale with confidence, and reduce the long-term cost of operating high-density GPU environments. zWARE also allows on-premise deployments to scale and obtain additional capacity or resources through the zSUITE ecosystem.

zFABRIC™

What is zFABRIC?

zFABRIC is a high performance RDMA networking solution purpose built for AI and GPU clusters. zFABRIC is like using high quality, non dealer performance parts in your race car enabling AI clusters to scale efficiently across racks and data centers without relying on closed or vendor specific networking. zFABRIC delivers the performance required for distributed AI training while giving operators flexibility in hardware sourcing avoiding vendor lock-in thus allowing faster deployments and lower long-term operating costs.

How does zFABRIC improve total cost of ownership (TCO)?

zFABRIC lowers CAPEX and OPEX for our customers by enabling mixed hardware generations, supporting multiple network vendors, and reducing downtime through automated recovery. Customers who deploy zFABRIC avoid vendor lock-in, extend hardware lifespan, and reduce operational overhead, significantly improving TCO.

How does zFABRIC improve reliability and uptime?

zFABRIC is designed to keep AI systems productive even when underlying components fail. Through automated failover, continuous link health monitoring, intelligent rerouting, and rapid recovery, zFABRIC minimizes disruption to training and inference workloads. This reduces GPU hang time, protects delivery timelines, and allows operators to meet SLA expectations minimizing manual intervention, resulting in more predictable operations and fewer costly interruptions, or mean time to recovery (MTTR). Overall, zFABRIC and Zettabyte's full product offerings allow organizations to bring systems online quickly while maintaining control and operational continuity.

Is zFABRIC limited to NVIDIA GPUs?

No, zFABRIC is vendor agnostic and supports heterogeneous GPU and accelerator environments based on open RDMA standards such as RoCEv2. This allows organizations to deploy and operate AI infrastructure using NVIDIA, AMD, or other accelerators without being locked into a single vendor ecosystem. As a result, customers can source hardware more flexibly, extend the usable life of existing assets, adapt faster to supply or pricing changes, and reduce long-term infrastructure costs while maintaining consistent performance at scale.

Which networking protocol does zFABRIC use and why?

zFABRIC primarily uses RoCEv2 (RDMA over Converged Ethernet) to deliver high-performance GPU networking on standard Ethernet infrastructure. This enables near InfiniBand performance while using widely available switches, optics, and cabling. As a result, customers can deploy AI clusters more quickly, scale across vendors and sites with less friction, and achieve high performance without the cost and constraints of proprietary networking stacks.

How many GPUs does zFABRIC support?

zFABRIC is designed to scale from thousands to hundreds of thousands of GPUs within a single AI environment. Scaling limits are determined by physical factors such as optics speed, switch capacity, and data center power and cooling, not by the zFABRIC software itself. This allows organizations to start at practical cluster sizes and expand over time without redesigning the network thus reducing deployment delays, protecting existing investments, and avoiding premature infrastructure replacement.

Can zFABRIC support cross data center AI clusters?

Yes, zFABRIC enables AI training and inference to run across geographically distributed data centers, allowing organizations to scale beyond a single site without redesigning their network. This makes it possible to bring capacity online faster, use existing facilities more effectively, and avoid costly overbuild in one location. By supporting long distance interconnection with production ready designs, zFABRIC allows teams to operate distributed AI systems reliably while improving utilization and lowering the total cost of scaling AI infrastructure.

zCLOUD™

What is zCLOUD?

zCLOUD is Zettabyte’s on-demand GPU cloud service built on the full Zettabyte software stack and deployed across GPU infrastructure worldwide. It provides immediate access to high-performance GPUs through on-demand and reserved capacity, allowing customers to start workloads quickly, scale predictably, and achieve high performance without the time and cost of building or overprovisioning their own infrastructure.

Why should customers choose zCLOUD?

zCLOUD allows customers to access high-performance GPU capacity without the delays, commitments, or overhead of traditional cloud models. It provides immediate availability for AI workloads while maintaining predictable performance and transparent cost structures. In addition, organizations running the zSUITE stack can use zCLOUD to monetize excess GPU capacity, improving infrastructure utilization and offsetting operating costs. This makes zCLOUD both a faster way to deploy AI workloads and a more efficient way to manage GPU infrastructure.

What level of reliability does zCLOUD provide?

zCLOUD is operated and managed by Zettabyte across sovereign-grade AI data centers built for high availability. Its architecture is designed to deliver consistent performance, predictable uptime, and clear service-level guarantees. For customers, this means fewer disruptions, faster time to usable compute, and reduced operational burden, allowing teams to focus on delivering results rather than managing infrastructure risk or downtime.

Does zCLOUD support heterogeneity?

Yes, zCLOUD is designed to operate across heterogeneous GPU environments and multiple data center locations. This allows organizations to use available hardware efficiently rather than waiting for a single GPU type or vendor. For leadership teams, this means faster access to compute, lower capital and procurement risk, and the ability to scale AI programs without being constrained by supply cycles or vendor lock-in. As hardware evolves, workloads can move seamlessly across environments.

How can I list my excess compute and/or available GPUs on zCLOUD?

Organizations using Zettabyte’s zSUITE can opt in to list available GPU capacity on zCLOUD with minimal additional integration. This allows idle infrastructure to be monetized quickly while remaining under the owner’s control. For organizations not yet on zSUITE, Zettabyte can support onboarding and integration to bring existing hardware onto the platform. The result is faster time to revenue, higher asset utilization, and improved return on existing infrastructure investments.

How many GPUs does zCLOUD currently manage?

zCLOUD manages more than 5,000 GPUs actively committed to the platform, providing customers with immediate access to production-ready capacity. Availability is visible at sign-up, allowing teams to move quickly without long procurement cycles. When specific configurations are not immediately available, customers can reserve capacity or join the queue, ensuring access as resources come online. For larger or time-sensitive requirements, dedicated clusters and expedited provisioning options are available.

Who is zCLOUD designed for?

zCLOUD is built for teams that need real GPU performance without enterprise cloud pricing. Primary customers include AI startups and small teams that need stable, affordable infrastructure to ship quickly; research labs and academic programs running experiments, coursework, and publications on limited budgets; and independent ML engineers and open-source contributors who value reliable, cost-effective compute.

GPU Compute Infra

Liquid Cooling

Why is direct liquid cooling needed for next generation High Performance GPUs and servers?

In today’s AI world, GPUs are becoming liquid cooled. Current high-end GPUs like the GB300 and the soon coming Rubens are designed to be fully liquid-cooled. Liquid cooling will be the standard because the GPUs draw significantly more energy which generates high heat density. Zettabyte is an expert concerning the most effective ways to cool next-generation AI systems through Direct Liquid Cooling. We offer complete system deployment, from design through testing, and monitoring. Regardless of the cooling challenge, ZETTABYTE provides expert guidance to develop the right solution for our customers based on their specific data center needs.

What type of liquid cooling options are available?

Currently, there are two primary types of liquid cooling solutions being deployed.

  • Direct to Chip Cooling (Direct Liquid Cooling): This technique involves taking coolant directly to the source by attaching cold plates to heat generating chips. These cold plates connect to coolant-filled tubes that extract the heat from the chips. 
  • Liquid to Rack Cooling: Coils containing chilled water are placed in the rear of the cabinet. These coils connect to the chilled water loop that runs through the facility, absorbing heat and transporting it out of the cabinet.
What considerations should I take into account prior to implementing liquid cooling?

When implementing liquid cooling, several considerations must be taken into account to ensure proper roll out, including pump and flow rate to maintain consistent circulation across variable workloads without frequent maintenance, heat exchanger capacity to ensure it is able to handle the thermal load of the system, and leak detection and prevention systems to ensure immediate detection of potential problems through advanced sensors and high quality seals and fittings. Zettabyte with its proven track record of successful deployments helps customers design, install, and maintain liquid cooling systems tailored to their customers' needs derisking this critical deployment while also providing neutral sourcing and unmatched procurement speed.

What are the benefits of deploying liquid cooling in data centers?

There are several benefits to deploying a liquid cooling solution, including i) noise reduction compared to air cooled solutions, ii) a smaller system footprint due to the removal of the fans, iii) increased reliability and performance, and iv) most importantly it enables the data center to support high density computing options, which air cooling cannot do.

How does Zettabyte help customers choose the correct cooling system?

Zettabyte’s strategic partners and investors produce 60+% of the world’s data center equipment. With this forward looking knowledge ZETTABYTE advises customers on the best cooling system for them considering both the current thermal needs of the existing equipment deployed and the higher cooling requirements needed for future growth and HPC workloads.

Thermal Systems

What coolant temperature differential minimizes pump energy while maximizing heat transfer for liquid cooling loops?

A coolant temperature rise between 7.5°C and 12°C is recommended. A common design target is a 10°C differential, which balances heat exchanger efficiency and pump workload, optimizing for a flow rate-to-heat dissipation ratio of approximately 1.5 LPM/kW.

How many backup Coolant Distribution Units (CDUs) per rack row prevent single-point cooling failures?

An N+1 redundancy model is the standard. This means for 'N' required CDUs to handle the operational load, one additional backup CDU is installed per row. This ensures seamless failover without service interruption if a primary unit fails.

What floor drain capacity handles worst-case coolant leak from a fully loaded liquid cooled rack?

Floor drains should be sized based on the maximum potential flow rate of a leak. For high-density racks, this means a capacity of at least 200-400 L/min (approx. 50-100 GPM) to rapidly remove coolant and prevent pooling during a major line breach.

How do you pressure-test liquid cooling manifolds before connecting $2M+ GPU servers?

Manifolds should be isolated and pressure-tested on-site at 1.5 times the maximum operating pressure (MAOP) using an inert gas like dry nitrogen. The pressure should be held for a minimum of 30-60 minutes to check for any drops, which would indicate a leak. Zettabyte’s TITAN, a next generation AI Data Center (AIDC) initiative designed to transform the availability, scalability, and efficiency of GPU-based computing worldwide, embeds these controls into its data hall designs and acceptance processes.  

What's the maximum acceptable vibration level for coolant pumps near sensitive GPU equipment?

The maximum acceptable vibration level for pumps and adjacent piping should not exceed 0.3 g RMS (root mean square). Vibration monitoring should be implemented to detect and address any excursion beyond this threshold to prevent long-term fatigue on fittings.

How close can you route liquid cooling lines to high-voltage electrical without EMI issues?

A minimum separation of 50 mm (approximately 2 inches) should be maintained between liquid cooling lines and high-voltage power cables. If this is not possible, EMI shielding (e.g., metallic conduit or raceways) must be used to prevent electromagnetic interference.

What coolant flow rate indicates impending pump failure before GPU thermal throttling occurs?

A sustained drop in flow rate of 10-15% below the established baseline for the given heat load is a critical early indicator of pump degradation or a blockage. This should trigger an alert for investigation well before temperature alarms are activated.

How do you maintain 24/7 cooling during planned maintenance on primary cooling infrastructure?

Temporary or portable cooling units can be used, or a bypass loop with quick-disconnect fittings can be implemented. This allows maintenance teams to isolate a section of the primary cooling system while maintaining coolant flow to the critical IT load, ensuring zero downtime.

What's the fastest way to isolate a leaking cooling line without shutting down adjacent racks?

The fastest method to isolate a leaking cooling line without shutting down adjacent racks is to use quarter-turn shut-off valves installed at the inlet and outlet of each rack's manifold. This allows technicians to immediately isolate a single rack from the main cooling loop in under a minute without affecting adjacent equipment.  Zettabyte’s experience delivering high-density, liquid-cooled AI infrastructure helps     customers reduce risk by ensuring cooling systems are designed for routine     maintenance and component replacement, allowing issues to be resolved quickly without broader impact to cluster operations.

How will the liquid cooling system handle a failure scenario such as a pump or coolant distribution unit (CDU) outage at full 4MW load?

An N+1 or 2N redundant pump configuration within the CDU is essential. Upon a primary pump failure, a backup pump automatically starts, maintaining coolant flow and pressure with no interruption to cooling delivery.

What measures are in place to detect and contain coolant leaks rapidly in liquid-cooled racks?

Leak detection ropes/cables are placed at the bottom of each rack and along coolant piping paths. These leak detection ropes/cables are connected to a monitoring system that triggers an alarm and can automatically shut off coolant flow to the affected zone. Zettabyte with its proven track record of successful deployments helps customers design, install, and maintain liquid cooling systems tailored to their customers' needs derisking this critical deployment.  

How is coolant water quality maintained to prevent corrosion, bio-growth, or scaling in cooling loops?

Coolant quality is maintained through continuous water treatment, including particle filters, UV sterilizers, and the use of pre-mixed, industry-approved coolants with specific anti-corrosion and biocide additives. Regular testing is required to verify coolant chemistry and detect degradation over time.

What coolant supply pressure maintains leak-tight operation while ensuring adequate flow through liquid cooled cold plates?

The optimal supply pressure is typically between 1.5 and 2.5 bar (approx. 22-36 PSI). This range is high enough to ensure adequate flow through the complex microchannels of the cold plates but low enough to minimize long-term stress on seals and fittings.

Compute & Rack Systems

What minimum aisle width allows two technicians to simultaneously service adjacent liquid cooled racks?

The minimum recommended width for a cold aisle is 1.2 meters (approx. 4 feet). For high-density racks requiring frequent access or simultaneous servicing, a width of 1.5 to 1.8 meters (5-6 feet) is preferable to ensure safety and efficiency.

How much floor load rating is needed for 8x liquid cooled servers plus liquid cooling equipment per rack?

The floor must be rated for a minimum of 2,500 kg (approx. 5,500 lbs) per rack footprint. A structural engineer must certify the point load capacity of the raised floor pedestals and tiles to prevent structural failure.

What rack spacing prevents thermal interference between neighboring high-density GPU clusters?

A minimum of 1.2 meters (4 feet) of separation between rows is recommended. In hot aisle/cold aisle containment setups, this ensures that exhaust heat from one row does not recirculate into the cold air intake of an adjacent row.

What lifting equipment safely moves 800+ pound liquid cooled servers into 42U racks?

A data center-rated server lift with a minimum capacity of 500 kg (1,100 lbs) is required. This data center rated server lift provides a stable platform and precise height adjustment to safely install or remove heavy servers without risking injury or equipment damage.

How is rack installation typically sequenced to preserve construction access while protecting installed equipment?

Installation should follow a "center-out" or "end-to-end" sequence, establishing main access aisles first. Protective coverings should be used on installed racks, and clear, temporary pathways must be maintained for moving equipment and personnel.

How is a failed liquid cooled server replaced without disrupting liquid cooling to other units?

Replacement is performed using drip-free quick-disconnect couplings. The typical approach is to: 1) Isolate the server electrically. 2) Close shut-off valves on the manifold for that server's loop. 3) Disconnect the quick-disconnect couplings. 4) Un-rack the failed server and replace it. 5) Reconnect and open valves. Zettabyte’s experience delivering high-density, liquid-cooled AI infrastructure helps customers reduce risk by ensuring cooling systems are designed for routine maintenance and component replacement, allowing issues to be resolved quickly without broader impact to cluster operations.

What's the maximum server extraction force before rack structural damage occurs?

The maximum extraction force should not exceed 500 Newtons (approx. 112 lbf). If a server is stuck, technicians should check for obstructions or binding in the rails rather than applying excessive force that could damage the rack frame.

How many spare rack positions are reserved for emergency server relocations during repairs?

A best practice is to reserve 5-10% of rack space as "hot spares". This provides empty, pre-cabled, and pre-plumbed slots to quickly relocate a workload or replace a failed server chassis without waiting for repairs.

Can the floor and power distribution safely support a rack with 250kW load and 800+ kg weight during seismic events?

Yes, when properly designed and installed, the floor system and power distribution can safely support a rack with a 250 kW load and an installed weight exceeding 800 kg during seismic events. The entire installation, including the floor, rack anchoring, and overhead power/cooling distribution, must be certified by a structural engineer to withstand the specified seismic load (e.g., 0.3g) for the total weight of the fully-loaded rack. Zettabyte with its battle tested approach builds future-proof, supply chain neutral AI data centers for Sovereigns and enterprises around the world.

What’s the standard procedure for replacing a failed liquid-cooled server with minimal downtime or coolant loss?

Replacement of a failed liquid-cooled server is typically performed by a trained two-person team following a documented procedure. The process includes 1) verification of correct server, 2) closing the isolation valves serving the affected cooling loop, 3) disconnecting the quick-disconnect couplings in a controlled manner, 4) using absorbent pads to manage any residual coolant drips, 5) removing and replacing the server, 6) reconnecting the cooling interfaces securely, and 7) gradually reopening the isolation valves to restore coolant flow.

What is the preventive maintenance schedule for liquid-cooling quick-disconnects and seals to ensure <0.1% leak rate annually?

Seals and O-rings on quick-disconnects should be inspected visually during every server maintenance event and undergo a full replacement every 2-3 years or as specified by the manufacturer to prevent material degradation and ensure a reliable seal.

Networking & Fabric

What cable management capacity handles 400G connections for 2000+ GPUs plus 50% growth?

Cable trays and pathways should be designed with a 50% "day one" fill rate. This provides capacity for future growth and ensures adequate space to maintain proper cable bend radius and airflow.

How many network switch tiers are needed to minimize latency for distributed AI training across 64+ nodes?

A single-tier (leaf-only) or two-tier (leaf-spine) architecture is optimal. For up to ~64 nodes, a single tier of high-radix switches can provide all-to-all connectivity. Beyond that, a two-tier design minimizes hops and maintains low latency. In Zettabyte deployments, network topology decisions are guided by workload scale, communication patterns, and growth expectations. zFABRIC, Zettabyte’s high performance RDMA networking solution, implements these architectures using vendor-agnostic, high-performance fabrics that maintain low latency as clusters expand, allowing teams to scale training environments without re-architecting the network or introducing unnecessary latency overhead.

What is the optimal InfiniBand spine-leaf ratio for AI workloads requiring all-to-all communication?

A 1:1 or 1:2 non-blocking or low-oversubscription ratio is optimal for AI/ML. For InfiniBand, this means the number of uplink ports from leaf switches to spine switches should be equal to or half the number of downlinks to servers.

How can 100+ high-speed cables per rack be routed without exceeding bend radius limits?

High cable densities are managed through the use of horizontal and vertical cable managers with rounded edges. Cable routing is designed to maintain a minimum bend radius of four to eight times the cable’s outer diameter, with planned pathways that avoid sharp turns, compression, or pinching along the route.

What cable labeling approach reduces the risk of misconnection during compressed network deployment windows (for example, 48-hour installs)?

Misconnection risk is reduced through the use of unique, machine readable labels such as QR codes or barcodes, applied at both ends of every cable. Each label encodes source and destination information (for example, rack, server, and port) and is linked to a centralized reference database to support verification during installation and troubleshooting.

How is cable integrity validated before connection to high-value networking equipment?

Cable integrity is validated by testing and certifying every high speed cable, particularly DACs and AOCs, using a network cable analyzer (for example, a Fluke DSX CableAnalyzer) prior to installation. Testing verifies key performance metrics such as insertion loss and return loss to confirm the cable meets required specifications before it is connected to production equipment.

How is a failed cable identified quickly within a large-scale InfiniBand fabric (for example, a 2,000-port deployment)?

In large scale InfiniBand environments, failed cables are identified using network management software with port-mapping capabilities that correlates switch port status to the physical cable infrastructure database. When a link fails, the system instantly identifies the exact cable and its physical location. In Zettabyte deployments, this correlation is implemented through fabric-aware tooling that maintains alignment between switch ports, cabling layouts, and rack topology. By preserving this mapping as part of day-one design and ongoing operations, failures can be isolated rapidly even at multi-thousand-port scale, reducing mean time to repair and minimizing disruption to distributed training workloads.

How is rolling network maintenance performed without breaking AI training jobs?

Rolling maintenance is enabled through the use of redundant network fabrics, such as dual-rail architectures. Maintenance activities are carried out on one fabric while traffic is failed over to the alternate fabric. This approach relies on network protocols and workload schedulers that support graceful failover to maintain job continuity during maintenance operations.

What cable slack is maintained to support emergency server repositioning during failure scenarios?

A service loop of approximately 0.5 to 1 meter is maintained for all cables connected to servers. This provides sufficient slack to slide a server out for servicing or reposition it into a spare slot without requiring cable disconnection or re-routing.

How is network fabric designed to ensure fault tolerance and minimize downtime in clusters exceeding 2,000 GPUs?

Fault tolerance is achieved through a dual-fabric (or “dual-rail”) network architecture. Each server is connected to two fully independent network fabrics, allowing traffic to automatically reroute through the alternate fabric if a component in one fabric fails. In Zettabyte deployments, this architecture is implemented and operated using zFABRIC, which is designed to keep AI systems productive even when underlying components fail. Through automated failover, continuous link health monitoring, intelligent rerouting, and rapid recovery, zFABRIC minimizes disruption to training and inference workloads. This reduces GPU hang time, protects delivery timelines, and allows operators to meet SLA expectations minimizing manual intervention, resulting in more predictable operations and fewer costly interruptions.

What telemetry is used to monitor congestion and error conditions in high-performance GPU interconnects in real time?

Traditional network monitoring tools such as SNMP provide useful baseline health and status information, but lack the real-time granularity required to detect transient congestion and error conditions in high-performance GPU interconnects. In Zettabyte deployments, In-band network telemetry (INT) and hardware-level counters from the switches and NICs are used. Zettabyte also incorporates ultra fine grain to include adjacent telemetry spanning compute, networking, power, cooling, and environmental signals. This provides real-time, granular data on packet buffers, latency, and congestion points, which is fed into a central monitoring platform giving sovereigns and enterprises full ownership, and complete operational independence.

Power Delivery & Electrical Systems

What circuit breaker sizing and characteristics are used to avoid nuisance trips during liquid cooled server power-on inrush?

Circuit breakers with high inrush tolerance—such as Type C or Type D—are used and sized at approximately 125% of the server’s maximum continuous load. The high-inrush characteristic accommodates brief startup current surges during power-on, while the breaker sizing continues to provide protection against sustained overcurrent conditions.

Products

How many independent power feeds are used per rack to avoid single-point power failures?

A minimum of two (A/B) independent power feeds from separate Power Distribution Units (PDUs) is required for each rack. Each feed should connect to one of the redundant Power Supply Units (PSUs) in each server, allowing continued operation if a single feed or distribution path is lost.

What UPS runtime keeps a 4MW load running during utility transfer to backup generators?

A UPS runtime of approximately 5–10 minutes at full load is typically provisioned. This interval provides sufficient buffer time for backup generators to start, stabilize, and assume the load, while accommodating potential start-up or synchronization delays during the transfer process.

How is power quality verified before connecting sensitive AI servers?

Power quality is verified by conducting a power quality audit using a power quality analyzer prior to energizing the servers. The audit measures voltage, current, harmonic distortion (THD), and transient conditions over a defined period to confirm that the supplied power is clean and stable.

What electrical testing is used to verify the safety of 480V distribution before rack energization?

Safety verification is performed through on-site testing that includes insulation resistance (megger) testing, continuity checks, and torque verification of all electrical connections prior to energizing the panel. These tests confirm the absence of faults and ensure that all connections are properly secured.

How much electrical panel space is reserved to support emergency circuit additions?

A minimum of 20% spare breaker capacity should be reserved within electrical panels. This margin allows for future expansion or emergency circuit additions without requiring major shutdowns or panel replacement.

What level of power monitoring granularity enables detection of failing servers before circuit breaker trips occur?

Outlet-level power monitoring on the rack PDUs is required. In Zettabyte deployments, zWARE ingests this real time power data and correlates it with adjacent telemetry across compute, networking, cooling, and environmental signals. zWARE’s fine grained visibility allows emerging faults to be identified early, enabling operators to intervene before issues escalate. Overall, zWARE and Zettabyte are designed for sovereign-grade deployments, where data isolation, auditability, and operational control are mandatory, not optional.

How is a 4 MW load balanced across multiple utility feeds during peak training periods?

For large loads during peak training periods, it is important to balance the power draw through the use of automated transfer switches (ATS) and intelligent PDUs that monitor power conditions across phases and utility feeds. Zettabyte’s zWARE improves utilization with smart scheduling, dynamic load balancing, and power/cooling integration acting like the electrical system of a high performance race car giving owners complete control and ownership.

What voltage sag levels should power systems be designed to tolerate during large compute job startups?

Power systems are typically designed to limit voltage sag to no more than approximately 8% of nominal voltage during large GPU job startups. This ensures that sudden current inrush from high-density GPU clusters does not disrupt system stability or impact ongoing operations.

Is a dual-feed A/B power design implemented to ensure continuous operation during a single-path failure at a 4 MW load?

A fully redundant 2N power architecture is required. This means two independent utility feeds, generators, UPS systems, and distribution paths all the way to the rack, ensuring that the failure of any single component does not cause an outage.

What is the backup power strategy during a utility failure, including UPS support duration and generator engagement timing?

The backup power strategy follows a UPS-to-generator handoff model. The UPS provides immediate, uninterruptible power upon utility loss, while generators are signaled to start. The generators are required to accept the full 4 MW load within the available UPS runtime, which is typically designed to be 5–15 minutes.

How is spare inventory for critical components (such as pumps, PDUs, and breakers) managed to meet sub-4-hour MTTR targets?

An on-site "hot spare" inventory of critical components is maintained. The inventory level is based on component failure rates (MTBF) and vendor lead times to ensure that any critical part can be replaced within the 4-hour Mean Time To Repair (MTTR) target. In Zettabyte deployments, sovereigns and enterprises benefit with a truly supply-chain-neutral AI data center model and unmatched procurement speed eliminating costly delays and lowering customers Total Cost of Ownership.

System Integration & Interdependencies

What testing and contingency planning should be in place to manage lead time risks and verify performance at full load?

Integrated Systems Testing (IST) at full load should be performed before go-live. Contingency planning involves securing on-site spares for critical long-lead-time components (e.g., pumps, chillers, high-capacity breakers) and pre-qualifying alternative vendors. Zettabyte embeds these practices directly into data hall designs, commissioning plans, and SLAs, allowing operators to identify issues early, shorten deployment timelines, and bring AI capacity online with greater certainty and lower operational risk.

Products

Should there be real-time coordination between GPU workload management and facility infrastructure for adaptive thermal/power control?

Yes, a Data Center Infrastructure Management (DCIM) system should be integrated with the workload scheduler. This allows the scheduler to place jobs based on real-time power and cooling availability in specific racks, preventing overloads. Zettabyte’s AI DCIM and command center gives customers precise operational control of their GPU estate across power, cooling, networking, and workload health, so capacity stays predictable and issues are isolated before they cascade. Zettabyte’s platform turns build specifications, rack topology, and thermal limits into clear, actionable runbooks, and keeps mixed GPU vendors and generations running as one governed fleet without forced upgrades.  

How should workload scheduling and checkpointing be prioritized when spare liquid-cooled servers are limited and a node fails during training?

A resilient scheduling policy should be applied. In Zettabyte environments, zWARE, a full AI Digital Command Center integrating orchestration, observability, and operational control across the entire AI stack, is deployed. zWARE incorporates ultra fine grain monitoring, multi-cluster federation across heterogeneous GPU domains, and performs intelligent workload-to-hardware matching based on real-time system state delivering continuous operator alerting and feedback loops. This allows critical jobs to be configured with frequent checkpointing to preserve progress. If a node fails, the scheduler attempts to restart the job from the most recent checkpoint on an available spare node. When no spare capacity is available, the job is queued and rescheduled based on its assigned priority.