Software (GPU / AI Software Stack)
Autoscaling
Autoscaling automatically adjusts computing resources based on current workload demand.
BF16
A 16-bit floating-point format optimized for AI training with improved numerical stability versus FP16.
Block
A collection of warps that execute together on a single streaming multiprocessor.
Checkpointing
Saving model and optimizer state during training so work can resume after interruption.
Compute-Bound vs Memory-Bound
Compute-bound vs memory-bound describes whether performance is limited by GPU arithmetic capacity or by how fast data can be fed to the GPU.
Container
A container is a lightweight and portable software package that runs consistently across different machines.
CUDA
NVIDIA’s proprietary programming platform that allows software to run computations directly on NVIDIA GPUs.
Data Parallelism
Data parallelism splits training data across multiple GPUs so each GPU trains on a portion of the dataset.
DCGM
NVIDIA Data Center GPU Manager, used for GPU telemetry, health checks, and monitoring.
Deep Learning Framework
High-level AI software such as PyTorch, TensorFlow, or JAX that abstracts GPU programming.
Distributed Training
Distributed training uses multiple GPUs or servers working together to train a single model faster or train models too large for one machine.
FP16
A 16-bit floating-point numeric format that improves performance and reduces memory and power usage.
FP32
A 32-bit floating-point numeric format used for higher-precision computation.
GPU Driver
Low-level software that allows the operating system and applications to communicate with GPU hardware.
GPU Memory Hierarchy
The layered memory system (registers, shared memory, caches, HBM, and host memory) that determines access speed and bandwidth.
Graph Compilation
Optimizing and compiling compute graphs for faster, more efficient execution on target hardware.
High Bandwidth Memory (HBM)
High Bandwidth Memory is stacked, ultra-fast memory located next to the GPU die to support large-model training and high data throughput.
Inference
Optimizing model weights using data and backpropagation to improve performance.
INT8
An 8-bit integer format commonly used to accelerate and compress AI inference.
Kernel
A function that runs in parallel across many GPU threads.
Kubernetes (K8s)
Kubernetes is an orchestration platform that automates deployment, scaling, and management of containerized AI workloads.
Large Language Model (LLM)
A Large Language Model is an AI model trained on massive datasets to understand and generate human-like text.
MFU (Model FLOPs Utilization)
A metric estimating how much of a GPU’s theoretical compute is actually achieved by a workload.
MIG (Multi-Instance GPU)
NVIDIA hardware feature that partitions one GPU into multiple isolated GPU instances.
Mixed Precision
Using multiple numeric precisions together to improve throughput and efficiency while preserving model quality.
MPI
A standard for message passing between processes in distributed computing.
Multi-Tenancy
Operating a cluster so multiple users or organizations can securely share the same infrastructure.
NCCL
NVIDIA’s library for fast collective communication (e.g., all-reduce) across GPUs and nodes.
NVLink
NVIDIA’s high-speed interconnect for GPU-to-GPU communication inside a server.
Occupancy
A measure of how fully an SM is utilized compared to its maximum capacity.
Orchestrator
Software that manages job lifecycle, placement, scaling, failures, and recovery across a cluster.
Pod
A Pod is the smallest deployable unit in Kubernetes, typically containing one or more tightly coupled containers.
ROCm
AMD’s open software platform for programming and running workloads on AMD GPUs.
Runtime
The execution layer that manages GPU kernels, memory allocation, and scheduling during program execution.
Scheduler
Software that decides which workloads get which GPUs and when.
Streaming Multiprocessor (SM)
The core compute unit inside an NVIDIA GPU that executes threads and warps.
Tensor Core
Specialized GPU hardware designed to accelerate matrix operations used in AI workloads.
GPU Driver
Low-level software that allows the operating system and applications to communicate with GPU hardware.
Thread
The smallest unit of execution on a GPU.
Training
Running a trained model to produce outputs (predictions, generations, classifications).
Transformer
The Transformer is a neural network architecture built around attention mechanisms and used in most state-of-the-art language and vision models.
Unified Memory
A memory model that lets CPU and GPU share a single memory address space.
vGPU
A virtualized GPU that allows multiple users or workloads to share a single physical GPU.
Warp
A group of 32 threads on NVIDIA GPUs that execute instructions simultaneously.