
Measuring useful GPU work with model FLOPs utilization and model bandwidth utilization
H100 one-year rental contracts rose roughly 40% between October 2025 and March 2026, and on-demand capacity has been hard to find, so the cost question has shifted from getting GPUs to getting useful work out of the ones already paid for. This is a look at what GPU utilization actually measures, why a reading near 100% can still hide wasted spend, and the two metrics used to separate a busy GPU from a productive one.
The shift from GPU access to GPU efficiency
H100 one-year rental contracts rose from about $1.70 to $2.35 per GPU-hour between October 2025 and March 2026, an increase of roughly 40%, and on-demand capacity has been largely sold out across GPU types, per SemiAnalysis's rental index.¹ When capacity is hard to get and the price has risen, the practical question for a team moves from whether it can get GPUs to whether it is getting its money's worth from the ones it has.
Buyer surveys point the same way. In a Q1 2026 market tracker reported by VentureBeat, GPU access and availability fell as a reason teams pick a provider, from about 21% to 15% of respondents in a single quarter, while cost per inference and total cost of ownership rose as a stated priority. The sample was small and the tracker is directional rather than definitive, but the pattern held across both survey waves.²
The workload mix is part of this. Inference now accounts for the majority of AI compute, and analyst estimates put it at 70-80% of AI compute spend by around 2026-2027 (Morgan Stanley and others).³ A model is trained in concentrated bursts and then served continuously, so over its life most of the GPU-hours go to inference, which is where small efficiency differences add up.
The gap between utilization and useful work
The most common efficiency number is GPU utilization as reported by tools like NVIDIA's Data Center GPU Manager (DCGM). It mostly answers whether the GPU was busy, not whether it was doing the work your model needs. A GPU can show a utilization figure near 100% while much of that time goes to memory stalls, communication waits, or kernels that are not advancing the model.
The gap can be large. Engineers at Trainy described a training run that read 100% GPU utilization but only about 20% Model FLOPs Utilization, the measure of useful compute; after they applied fused kernels and tuned model parallelism, utilization stayed high while the useful-work figure rose to about 38%, per a write-up in MarkTechPost.⁴ The utilization graph looked much the same before and after, while the work done per GPU-hour roughly doubled.
So "busy" and "productive" are separate questions, and the metric that fits depends on whether the workload is training or inference.
Model FLOPs utilization for training
Model FLOPs Utilization (MFU) is the ratio of the throughput a GPU actually achieves to the throughput it would reach running at its theoretical peak FLOPs. Google's PaLM team proposed it as a hardware-agnostic way to compare training efficiency across different setups.
As a rough guide, the industry treats 35-45% MFU as good for large-model training, with 50%+ considered strong. Reported figures sit in that band: Meta's Llama 3 405B reported 38-43% MFU during training,⁶ and NVIDIA's Megatron-LM reports up to 47% on H100 clusters.⁷ Mixture-of-experts models trained in fp8 tend to come in lower, nearer 20%. These are snapshots as of mid-2026 and move with hardware and framework changes.
When MFU is low, the time is usually going somewhere specific: communication between GPUs, pipeline bubbles, memory stalls, or kernels that leave the tensor cores idle. There is one further adjustment on long runs. Effective MFU, as clockwork.io describes it, discounts compute that ran but did not advance the model, most of all the work replayed after a failure forces a restart from the last checkpoint.⁵ A run with a healthy instantaneous MFU can still have a lower effective figure once restarts are counted.
Model bandwidth utilization for inference
Inference decode is usually limited by memory bandwidth rather than compute, which makes MFU the wrong lens for it. Generating each token requires moving the model's active weights from memory to the compute units, and at small batch sizes the GPU spends most of its time waiting on that data movement rather than doing math.
Model Bandwidth Utilization (MBU) measures this directly: the memory bandwidth a workload actually achieves divided by the hardware's peak. As a reference point, published benchmarks show around 60% MBU on a pair of H100-80GB GPUs at batch size 1,⁹ and MBU tends to fall as batch size grows. For single-request decode speed, MBU is the number to watch rather than MFU, because the bottleneck is bandwidth (per a recent write-up from kog.ai).⁸
The levers for raising it differ from training. Batching trades single-request latency for higher overall throughput, KV-cache handling affects how much memory moves per token, and lower-precision formats such as fp8 or fp4 change the ratio of math to bytes moved. Serving stacks like vLLM exist largely to manage these trade-offs.
Matching the metric to the workload
The metric that tells you whether spend maps to work depends on the workload. The reference below pairs each metric with what it measures, a rough sense of a good reading, and what a low one usually points to.
At the platform level, the same gap between busy and useful is what topology-aware scheduling, GPU partitioning, and pooling of shared capacity are built to reduce. Those are operator levers more than application ones, but they show up in the same metrics.
Raising the work done per GPU-hour is usually cheaper than adding more of it, so the metric that fits the workload is the one to watch, MFU for training and MBU for inference decode, with the plain utilization graph read as a check for idle machines rather than proof the spend is working.
Sources
- SemiAnalysis, The Great GPU Shortage: Rental Capacity (H100 one-year rental price index), April 2026. https://newsletter.semianalysis.com/p/the-great-gpu-shortage-rental-capacity
- VentureBeat, on the Q1 2026 AI Infrastructure and Compute Market Tracker (GPU utilization and TCO), 2026. https://venturebeat.com/infrastructure/5-gpu-utilization-the-401-billion-ai-infrastructure-problem-enterprises-cant-keep-ignoring
- Morgan Stanley, AI Enters a New Phase of Inference. https://www.morganstanley.com.au/ideas/ai-enters-a-new-phase-of-inference
- MarkTechPost, on Trainy's analysis of GPU utilization versus MFU (100% utilization, about 20% MFU), September 2024. https://www.marktechpost.com/2024/09/03/why-gpu-utilization-falls-short-understanding-streaming-multiprocessor-sm-efficiency-for-better-llm-performance/
- clockwork.io, Decoding GPU Efficiency Part 1: The FLOPs Fallacy (effective MFU), March 2026. https://clockwork.io/blog/decoding-gpu-efficiency-part-1-the-flops-fallacy/
- Glenn K. Lockwood, Model FLOPs Utilization (definition; Llama 3.1 at 38-43%). https://www.glennklockwood.com/garden/MFU
- NVIDIA, Megatron-LM repository (up to 47% MFU on H100 clusters). https://github.com/NVIDIA/Megatron-LM
- kog.ai, Real-time LLM Inference on Standard Datacenter GPUs (MBU as the decode metric), 2026. https://blog.kog.ai/real-time-llm-inference-on-standard-gpus-3-000-tokens-s-per-request/
- Best Practices for LLM Hardware Benchmarking (about 60% MBU on 2xH100 at batch size 1), January 2026. https://latitude-blog.ghost.io/blog/llm-hardware-benchmarking-best-practices/
.webp)
