noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

Arista Networks Inc.

Statement of Changes in Beneficial Ownership (Form 4)
American Honda Motor Co. Inc.

Honda HRC Progressive CRF Bodywork on Auction
The University of New Mexico

UNM faculty and students travel to Japan to commemorate 80th[...]

Nvidia Corporation

09/09/2025 | News release | Distributed by Public on 09/09/2025 11:40

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

As large language models (LLMs) grow larger, they get smarter, with open models from leading developers now featuring hundreds of billions of parameters. At the same time, today's leading models are also capable of reasoning, which means that they generate many intermediate reasoning tokens before delivering a final response to the user. The combination of these two trends-larger models that think using more tokens-drives the need for significantly higher compute performance.

Delivering the highest performance on production workloads takes a state-of-the-art technology stack-spanning chips, systems, and software-and an expansive developer ecosystem that is constantly building on that stack.

MLPerf Inference v5.1 is the latest version of the MLPerf Inference industry standard benchmark. With benchmark rounds held twice per year, the benchmark features many tests of AI inference performance and is regularly updated with new models and scenarios. This round features:

DeepSeek-R1 - a popular 671-billion parameter mixture-of-experts (MoE) reasoning model, developed by DeepSeek. In the server scenario, the time-to-first-token (TTFT) threshold is 2 seconds with a 12.5 tokens/second/user (TPS/user) target. All TPS/user targets are 99th percentile, meaning that 99% of tokens meet or exceed that TPS/user speed.
Llama 3.1 405B - MLPerf Inference v5.1 adds a new interactive scenario for the largest of the Llama 3.1 series of models, providing a faster 12.5 TPS/user threshold with a shorter 4.5 second TTFT requirement compared to the existing server scenario.
Llama 3.1 8B - an 8-billion parameter member of the Llama 3.1 series of models with offline, server (2 second TTFT, 10 TPS/user), and interactive (0.5 second TTFT, 33 TPS/user) scenarios. This replaces the GPT-J benchmark used in prior rounds.
Whisper - a popular speech recognition model that recently saw nearly 5 million downloads in a month on HuggingFace. This replaces RNN-T, which was featured in prior editions of the MLPerf Inference benchmark suite.

This round, NVIDIA submitted the first results using the new Blackwell Ultra architecture, announced in March. It came just six months after Blackwell made its debut in the available category in MLPerf Inference v5.0, setting new inference performance records. Additionally, the NVIDIA platform set new performance records on all newly added benchmarks this round-DeepSeek-R1, Llama 3.1 405B, Llama 3.1 8B, and Whisper-and continues to hold per-GPU performance records on all other MLPerf inference benchmarks.

MLPerf Inference Per-Accelerator Records
Benchmark	Offline	Server	Interactive
DeepSeek-R1	5,842 tokens/second/GPU	2,907 tokens/second/GPU	**
Llama 3.1 405B	224 tokens/second/GPU	170 tokens/second/GPU	138 tokens/second/GPU
Llama 2 70B 99.9%	12,934 tokens/second/GPU	12,701 tokens/second/GPU	7,856 tokens/second/GPU
Llama 2 70B 99%	13,015 tokens/second/GPU	12,701 tokens/second/GPU	7,856 tokens/second/GPU
Llama 3.1 8B	18,370 tokens/second/GPU	16,099 tokens/second/GPU	15,284 tokens/second/GPU
Stable Diffusion XL	4.07 samples/second/GPU	3.59 queries/second/GPU	**
Mixtral 8x7B	16,099 tokens/second/GPU	16,131 tokens/second/GPU	**
DLRMv2 99%	87,228 samples/second/GPU	80,515 samples/second/GPU	**
DLRMv2 99.9%	48,666 samples/second/GPU	46,259 queries/second/GPU	**
Whisper	5,667 tokens/second/GPU	**	**
R-GAT	81,404 samples/second/GPU	**	**
Retinanet	1,875 samples/second/GPU	1,801 queries/second/GPU	**

Table 1. Performance records per GPU based on submissions powered by the NVIDIA platform.

MLPerf Inference v5.0 and v5.1, Closed Division. Results retrieved from www.mlcommons.org on September 9, 2025. NVIDIA platform results from the following entries: 5.0-0072, 5.1-0007, 5.1-0053, 5.1-0079, 5.1-0028, 5.1-0062, 5.1-0086, 5.1-0073, 5.1-0008, 5.1-0070,5.1-0046, 5.1-0009, 5.1-0060, 5.1-0072. 5.1-0071, 5.1-0069 Per chip performance derived by dividing total throughput by number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.0 or v5.1.The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

NVIDIA also made extensive use of NVFP4 acceleration across all DeepSeek-R1 and Llama model submissions using the Blackwell and Blackwell Ultra architectures.

In this post, we take a closer look at these performance results and the full-stack technologies that enabled them.

Blackwell Ultra sets reasoning records in MLPerf debut

This round, NVIDIA submitted results in the available category using the GB300 NVL72 rack-scale system, the first-ever MLPerf submissions using the Blackwell Ultra architecture. Blackwell Ultra builds upon the many advances in the NVIDIA Blackwell architecture, with several key enhancements:

1.5x higher peak NVFP4 AI compute
2x higher attention-layer compute
1.5x higher HBM3e capacity

Compared to the GB200 NVL72 submission, GB300 NVL72 delivered up to 1.4x higher performance per GPU, setting the standard on the new DeepSeek-R1 benchmark. And compared to unverified results collected on a Hopper-based system, Blackwell Ultra delivered about 5x higher throughput per GPU-translating into significantly higher AI factory throughput and much lower cost per token.

DeepSeek-R1 Performance
Architecture	Offline	Server
Hopper	1,253 tokens/second/GPU	556 tokens/second/GPU
Blackwell Ultra	5,842 tokens/second/GPU	2,907 tokens/second/GPU
Blackwell Ultra Advantage	4.7x	5.2x

Table 2. Per-GPU performance on DeepSeek-R1.

MLPerf Inference v5.1, Closed. Blackwell Ultra results based on results in entry 5.1-0072. Hopper results not verified by MLCommons Association. Per-GPU performance is not a primary metric of MLPerf Inference v5.1 and is calculated by dividing reported throughput by the number of reported accelerators. Verified results retrieved from https://www.mlcommons.org on September 9, 2025. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See https://www.mlcommons.org for more information.

These results were enabled by the world-class architectural capabilities of Blackwell Ultra and the highly optimized and versatile NVIDIA inference stack. Here are some of the key technologies powering the NVIDIA Blackwell Ultra submissions on DeepSeek-R1:

Extensive use of NVFP4

The base DeepSeek-R1 model incorporates weights stored in FP8 precision. Using a quantization recipe developed by NVIDIA and included as part of the NVIDIA TensorRT Model Optimizer library, the majority of the DeepSeek-R1 weights were successfully quantized to NVFP4, a four-bit floating point format developed by NVIDIA and accelerated by Blackwell and Blackwell Ultra Tensor Cores. This optimization led to reduced model size and the ability to use the higher-throughput NVFP4 compute built into Blackwell and the even higher throughput in Blackwell Ultra-all while meeting the strict target accuracy of the benchmark.

FP8 key-value cache

In the base DeepSeek-R1 model, the key-value (KV) cache is stored in the BF16 data format. Once again, using both TensorRT Model Optimizer and TensorRT-LLM inference libraries, the KV-cache was quantized to FP8 precision, significantly reducing its memory footprint and enabling higher performance.

New parallelism techniques

The unique architecture of the DeepSeek-R1 model means that traditional tensor parallel and pipeline parallel techniques used for multi-GPU execution were insufficient for maximum performance. For the NVIDIA DeepSeek-R1 submissions, expert parallelism was used for the MoE portion of model execution, and data parallelism was used for the attention mechanism. This required redesigned MoE and attention kernels, as well as new communication kernels to perform gather and scatter operations.

With this new parallelism technique, balancing the context query workload across all GPUs is critical. This challenge involves maintaining both a high overall throughput, and a low first-token latency. We developed Attention Data Parallelism Balance (ADP Balance), a technique that intelligently distributes the context query to optimize for both of these metrics.This ensures every GPU remains productive, preventing bottlenecks and delivering a responsive, high-speed experience for all users. For a detailed technical explanation, please refer to our TensorRT-LLM GitHub page.

CUDA Graphs

During iterations of the inference process that were decode-only, NVIDIA submissions use CUDA Graphs to record and replay GPU operations using a single CPU operation. This reduces CPU overhead, leading to higher performance.

Disaggregated serving Blackwell performance on Llama 3.1 405B Interactive

The newly added interactive scenario for the Llama 3.1 405B benchmark introduces more stringent TTFT and TPS/user constraints compared to the server scenario, at more than 2x the output token rate and 1.3x faster TTFT. Delivering strong performance on this challenging new benchmark scenario required the application of many state-of-the-art technologies in the NVIDIA Blackwell platform and NVIDIA inference software stack.

For serving very large models like Llama 3.1 405B at interactive token rates, sharding the models across many GPUs enables more aggregate compute to be used. That enables optimal throughput and meets latency requirements. To support the immense communication needs of large model, multi-GPU inference, both the NVIDIA Blackwell and Blackwell Ultra platforms support all-to-all communication via NVLink fabric at 1,800 GB/s between 72 GPUs for total aggregate bandwidth of 130 TB/s.

To meet these requirements while delivering maximum throughput, NVIDIA submissions using the GB200 NVL72 rack-scale system on this benchmark also employed disaggregated serving. This implementation contributed significantly to the nearly 1.5x increase in throughput per GPU compared to traditional aggregated serving using in-flight batching on a DGX B200 system. That's a greater than 5x cumulative improvement compared to in-flight batching results collected on a DGX H200 system.

Figure 1. Blackwell with disaggregated serving delivers more than 5x Hopper performance on Llama 3.1 405B interactive.

Hopper results from 8-GPU HGX H200 submission in entry 5.1-0075. Blackwell baseline from result at entry 5.1-0069 using DGX B200 with 8 GPU. Blackwell with disaggregated serving using GB200 NVL72 with 72 GPUs from entry 5.1-0071. Performance is per GPU, calculated by dividing total reported throughput by accelerator count. Performance per GPU is not a primary metric of MLPerf Inference. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Traditional LLM deployments typically co-locate the two main stages of inference-context and generation-on the same GPU or node. However, these phases have fundamentally different characteristics: Context is token-parallel and compute-intensive, while generation is autoregressive and latency-sensitive. They also operate under distinct service level agreements-TTFT for context, and intertoken latency (ITL) for generation-which call for different model parallelism strategies. Co-locating them often results in inefficient resource use, particularly for long input sequences.

Disaggregated serving decouples context and generation across separate GPUs or nodes, enabling independent optimization for each phase. This approach allows different parallelism techniques and flexible GPU allocation, improving overall system efficiency.

The NVIDIA Dynamo inference framework also provides support for disaggregated serving. The latest release of Dynamo also features many additional capabilities for inference deployments beyond disaggregated serving, including SLA-based autoscaling, real-time LLM observability metrics, and fault tolerance. Learn more here.

Key takeaways

NVIDIA continues to demonstrate leading inference performance across a breadth of AI models and scenarios, with outstanding results on both newly added and existing benchmarks. The debut submission of the GB300 NVL72 rack-scale system based on the Blackwell Ultra GPU architecture delivered a large boost for reasoning inference just six months after the first available-category submission of the Blackwell-based GB200 NVL72 submission.

Additionally, the Llama 3.1 405B interactive submission using disaggregated serving demonstrated how state-of-the-art serving techniques can yield significant increases in inference throughput.

To reproduce the great results from this blog, check out the MLPerf Inference v5.1 GitHub repository here.

And to further accelerate inference performance, NVIDIA also unveiled Rubin CPX-a processor purpose-built to accelerate long context processing. To learn more about this new Rubin CPX, see this technical blog.

Nvidia Corporation published this content on September 09, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 09, 2025 at 17:40 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]

Back

View original format