Zettabyte Glossary | AI Data Center, GPU, AIDC & Cloud Infrastructure Term

Software (GPU / AI Software Stack)

Term

Meaning

Autoscaling

Autoscaling automatically adjusts computing resources based on current workload demand.

BF16

A 16-bit floating-point format optimized for AI training with improved numerical stability versus FP16.

Block

A collection of warps that execute together on a single streaming multiprocessor.

Checkpointing

Saving model and optimizer state during training so work can resume after interruption.

Compute-Bound vs Memory-Bound

Compute-bound vs memory-bound describes whether performance is limited by GPU arithmetic capacity or by how fast data can be fed to the GPU.

Container

A container is a lightweight and portable software package that runs consistently across different machines.

CUDA

NVIDIA’s proprietary programming platform that allows software to run computations directly on NVIDIA GPUs.

Data Parallelism

Data parallelism splits training data across multiple GPUs so each GPU trains on a portion of the dataset.

DCGM

NVIDIA Data Center GPU Manager, used for GPU telemetry, health checks, and monitoring.

Deep Learning Framework

High-level AI software such as PyTorch, TensorFlow, or JAX that abstracts GPU programming.

Distributed Training

Distributed training uses multiple GPUs or servers working together to train a single model faster or train models too large for one machine.

FP16

A 16-bit floating-point numeric format that improves performance and reduces memory and power usage.

FP32

A 32-bit floating-point numeric format used for higher-precision computation.

GPU Driver

Low-level software that allows the operating system and applications to communicate with GPU hardware.

GPU Memory Hierarchy

The layered memory system (registers, shared memory, caches, HBM, and host memory) that determines access speed and bandwidth.

Graph Compilation

Optimizing and compiling compute graphs for faster, more efficient execution on target hardware.

High Bandwidth Memory (HBM)

High Bandwidth Memory is stacked, ultra-fast memory located next to the GPU die to support large-model training and high data throughput.

Inference

Optimizing model weights using data and backpropagation to improve performance.

INT8

An 8-bit integer format commonly used to accelerate and compress AI inference.

Kernel

A function that runs in parallel across many GPU threads.

Kubernetes (K8s)

Kubernetes is an orchestration platform that automates deployment, scaling, and management of containerized AI workloads.

Large Language Model (LLM)

A Large Language Model is an AI model trained on massive datasets to understand and generate human-like text.

MFU (Model FLOPs Utilization)

A metric estimating how much of a GPU’s theoretical compute is actually achieved by a workload.

MIG (Multi-Instance GPU)

NVIDIA hardware feature that partitions one GPU into multiple isolated GPU instances.

Mixed Precision

Using multiple numeric precisions together to improve throughput and efficiency while preserving model quality.

MPI

A standard for message passing between processes in distributed computing.

Multi-Tenancy

Operating a cluster so multiple users or organizations can securely share the same infrastructure.

NCCL

NVIDIA’s library for fast collective communication (e.g., all-reduce) across GPUs and nodes.

NVLink

NVIDIA’s high-speed interconnect for GPU-to-GPU communication inside a server.

Occupancy

A measure of how fully an SM is utilized compared to its maximum capacity.

Orchestrator

Software that manages job lifecycle, placement, scaling, failures, and recovery across a cluster.

Pod

A Pod is the smallest deployable unit in Kubernetes, typically containing one or more tightly coupled containers.

ROCm

AMD’s open software platform for programming and running workloads on AMD GPUs.

Runtime

The execution layer that manages GPU kernels, memory allocation, and scheduling during program execution.

Scheduler

Software that decides which workloads get which GPUs and when.

Streaming Multiprocessor (SM)

The core compute unit inside an NVIDIA GPU that executes threads and warps.

Tensor Core

Specialized GPU hardware designed to accelerate matrix operations used in AI workloads.

GPU Driver

Low-level software that allows the operating system and applications to communicate with GPU hardware.

Thread

The smallest unit of execution on a GPU.

Training

Running a trained model to produce outputs (predictions, generations, classifications).

Transformer

The Transformer is a neural network architecture built around attention mechanisms and used in most state-of-the-art language and vision models.

Unified Memory

A memory model that lets CPU and GPU share a single memory address space.

vGPU

A virtualized GPU that allows multiple users or workloads to share a single physical GPU.

Warp

A group of 32 threads on NVIDIA GPUs that execute instructions simultaneously.

Cluster / Systems (GPU Clusters & Architecture)

Term

Meaning

Accelerator

Specialized hardware (GPU, TPU, or ASIC) that speeds up compute compared to CPUs.

All-Reduce

A collective operation that aggregates values across workers and distributes the result to all workers.

Bare Metal

Running workloads directly on physical servers without virtualization.

Checkpoint Restart

Resuming a job after failure by loading a previously saved checkpoint.

Clos Topology

A scalable multi-stage network design widely used in modern data centers.

East-West Traffic

Traffic moving between servers inside the data center.

Elasticity

The ability to scale resources up or down based on demand.

Fabric

The high-speed network connecting servers and enabling distributed training or inference.

Failure Domain

The scope within which a single failure can impact service.

Fat-Tree Topology

A high-bandwidth network topology designed to reduce congestion and provide near non-blocking performance.

Gang Scheduling

Scheduling all parts of a distributed job simultaneously so it can start and run efficiently.

GPU Node

A server specifically designed to host and power multiple GPUs for AI and HPC workloads.

Graphics Processing Unit (GPU)

A Graphics Processing Unit is a chip built for massive parallel math operations, making it essential for training and running modern AI models.

Heterogeneous Cluster

A cluster containing multiple types or generations of GPUs or accelerators.

Homogeneous Cluster

A cluster built from a single consistent server and GPU configuration.

InfiniBand

High-performance networking technology optimized for low latency and high bandwidth in AI/HPC clusters.

Job Preemption

Stopping or evicting a running job so resources can be reallocated to higher-priority workloads.

Memory Bandwidth

Memory bandwidth is the rate at which data can move between the GPU and its memory, often determining the ceiling of model performance.

Multi-Node Computing

Running a workload across multiple servers connected by a network fabric.

NIC

A Network Interface Card connects a server or GPU node to the network and determines its bandwidth and latency capabilities.

Node

A single physical server containing CPUs, GPUs, memory, storage, and network interfaces.

Oversubscription

A condition where available network bandwidth is less than what workloads would need at peak.

Parameter Server

A distributed training approach where model parameters are centrally stored and updated.

RDMA

Remote Direct Memory Access, allowing one machine to access another’s memory with minimal CPU overhead.

RoCE

RDMA over Converged Ethernet, enabling RDMA-style performance on Ethernet networks.

Scale-Out

Increasing performance or capacity by adding more nodes to the cluster.

Scale-Up

Increasing performance by adding more GPUs within a single node (often via NVLink/NVSwitch).

Single-Node Computing

Running a workload entirely within one server.

SLA

A Service Level Agreement that defines guaranteed performance, availability, and responsibilities.

SLI

A Service Level Indicator used to measure service performance.

Straggler

A slow worker that bottlenecks a distributed job’s overall progress.

SXM vs PCIe

SXM and PCIe are two GPU form factors—SXM supports higher power and denser multi-GPU interconnects, while PCIe is modular and widely compatible.

Telemetry

Telemetry refers to real-time sensor data, such as temperature, power, and flow rates, used to monitor GPU clusters and facilities.

Topology

How GPUs, nodes, and switches are physically and logically interconnected.

Virtualized Cluster

A cluster that uses VMs and/or containers to isolate workloads and tenants.

Infrastructure / Data Center

Term

Meaning

AI Data Center (AIDC)

A data center designed specifically for high-density GPU workloads and AI infrastructure.

Breaker

A protective electrical device that interrupts power during faults.

Busbar

A high-current conductor used for efficient power distribution.

Chiller

Cooling equipment that removes heat from a fluid loop.

Colocation

Hosting customer equipment in a third-party data center facility.

Coolant Distribution Unit (CDU)

A Coolant Distribution Unit circulates and manages chilled liquid for liquid-cooled racks in a data center.

DCIM

Data Center Infrastructure Management software monitors power, cooling, racks, and environmental conditions across the facility.

Density

How much compute or power is installed per rack.

Direct Liquid Cooling (DLC)

Liquid cooling delivered directly to CPUs or GPUs via cold plates.

Edge Data Center

A smaller data center located close to end users to reduce latency.

Fiber Optics

Cabling that transmits data as light for high-bandwidth links.

Generator

Equipment that provides longer-duration backup power when utility power fails.

Grounding

A safety system providing a controlled path for fault current.

Heat Reuse

Capturing and repurposing waste heat from IT systems.

Hot Aisle / Cold Aisle

A layout strategy separating hot exhaust air from cold intake air to improve cooling efficiency.

Hyperscale

Very large-scale data center deployment with massive compute and power capacity.

Immersion Cooling

Cooling method where servers are submerged in dielectric fluid.

kW per Rack

The electrical power consumed by all equipment in a rack.

Latency

The time delay for data to travel from source to destination.

Liquid Cooling

Cooling that uses liquid to remove heat more effectively than air.

MTBF

Mean Time Between Failures, measuring reliability.

MTTR

Mean Time To Repair, measuring serviceability.

PDU

Power Distribution Unit that delivers electrical power to rack equipment.

Power Usage Effectiveness (PUE)

Power Usage Effectiveness is a metric that compares total facility power to IT power to measure data center efficiency.

Rack

A standardized enclosure that holds servers, switches, and power or cooling components.

Rack Unit (RU)

A vertical measurement for rack equipment; 1U equals 1.75 inches.

Redundancy (N, N+1, 2N)

Design approaches for fault tolerance using spare or duplicate systems.

Spine-Leaf Network

A scalable two-tier data center network architecture.

Thermal Design Power (TDP)

Thermal Design Power is the maximum amount of heat a GPU is expected to generate and determines its cooling and power requirements.

Throughput

The volume of data transferred per unit time.

Transceiver

Optical or electrical module enabling high-speed network links.

UPS

Uninterruptible Power Supply that provides short-term backup power.

Uptime Tier (I–IV)

Classification describing data center resiliency and expected availability.

Economics & Operations

Term

Meaning

$/GPU-Hour

A pricing or cost metric representing dollars per GPU per hour of usage.

$/TFLOP

A cost-efficiency metric representing dollars per unit of compute capability.

CapEx

Capital expenditures for hardware, construction, and long-lived assets.

Carbon Intensity

Emissions associated with electricity usage, usually expressed per kWh.

Energy Efficiency

Compute output per watt of power consumed.

Fragmentation

Inefficient resource use caused by poor job packing or allocation.

GPU Utilization

GPU utilization measures how much of a GPU’s compute capacity is actually being used, a key metric for cluster efficiency.

OpEx

Operating expenditures such as power, staffing, and maintenance.

Glossary of Terms

Software (GPU / AI Software Stack)

Autoscaling

BF16

Block

Checkpointing

Compute-Bound vs Memory-Bound

Container

CUDA

Data Parallelism

DCGM

Deep Learning Framework

Distributed Training

FP16

FP32

GPU Driver

GPU Memory Hierarchy

Graph Compilation

High Bandwidth Memory (HBM)

Inference

INT8

Kernel

Kubernetes (K8s)

Large Language Model (LLM)

MFU (Model FLOPs Utilization)

MIG (Multi-Instance GPU)

Mixed Precision

MPI

Multi-Tenancy

NCCL

NVLink

Occupancy

Orchestrator

Pod

ROCm

Runtime

Scheduler

Streaming Multiprocessor (SM)

Tensor Core

GPU Driver

Thread

Training

Transformer

Unified Memory

vGPU

Warp

Cluster / Systems (GPU Clusters & Architecture)

Accelerator

All-Reduce

Bare Metal

Checkpoint Restart

Clos Topology

East-West Traffic

Elasticity

Fabric

Failure Domain

Fat-Tree Topology

Gang Scheduling

GPU Node

Graphics Processing Unit (GPU)

Heterogeneous Cluster

Homogeneous Cluster

InfiniBand

Job Preemption

Memory Bandwidth

Multi-Node Computing

NIC

Node

Oversubscription

Parameter Server

RDMA

RoCE

Scale-Out

Scale-Up

Single-Node Computing

SLA

SLI

Straggler

SXM vs PCIe

Telemetry