VMware LLC

09/16/2025 | News release | Distributed by Public on 09/16/2025 09:30

Deploy Distributed LLM inference with GPUDirect RDMA over InfiniBand in Private AI

At the VMware Explore 2025 keynote, Chris Wolf announced DirectPath enablement for GPUs with VMware Private AI, marking a major step forward in simplifying and scaling enterprise AI infrastructure. By granting VMs exclusive, high-performance access to NVIDIA GPUs, DirectPath allows organizations to fully harness GPU capabilities without added licensing complexity. This makes it easier to experiment, prototype, and move AI projects into production with confidence. Besides, VMware Private AI brings models closer to enterprise data, delivering secure, efficient, and cost-effective deployments. Jointly engineered by Broadcom and NVIDIA, the solution empowers organizations to accelerate innovation while reducing total cost of ownership (TCO).

These advancements come at a critical time. Serving state-of-the-art large language models (LLMs) like DeepSeek-R1, Meta Llama-3.1-405B-Instruct, and Qwen3-235B-A22B-thinking at full context length often exceeds the capacity of a single 8x H100 GPU server, making distributed inference essential. Aggregating resources from multiple GPU-enabled nodes allows these models to run efficiently, though it introduces new challenges in infrastructure management, interconnect optimization, and workload scheduling.

This is where VMware Cloud Foundation (VCF) plays a vital role. VCF is the industry's first private cloud platform to deliver public cloud scale and agility while providing on-premises security, resilience, and performance-all with lower TCO. Leveraging technologies such as NVIDIA NVLink, NVSwitch, and GPUDirect® RDMA, VCF enables high-bandwidth, low-latency communication across nodes. It also ensures that network interconnects like InfiniBand (IB) and RoCEv2 (RDMA over Converged Ethernet) are used effectively, reducing communication overhead that can limit distributed inference performance. With VCF, enterprises can deploy production-grade distributed inference, ensuring even the largest reasoning models run reliably with predictable performance characteristics.

This blog post summarizes our white paper, "Deploy Distributed LLM inference with GPUDirect RDMA over Infiniband in VMware Private AI", which provides architectural guidance, detailed deployment steps, and technical best practices for distributed LLM inference across multiple GPU nodes on VCF and NVIDIA HGX servers with GPUDirect RDMA over IB.

Key Highlights and Technical Deep Dives

Leveraging HGX Servers for Maximum Performance

NVIDIA HGX servers play a central role, with their internal topology-PCIe switches, NVIDIA H100/H200 GPUs, and ConnectX-7 IB HCAs-described in detail. A 1:1 GPU-to-NIC ratio is emphasized as critical for optimal GPUDirect RDMA performance, ensuring each accelerator has a dedicated, high-bandwidth path.

Intra-Node and Inter-Node Communication

NVLink and NVSwitch enable ultra-fast communication within a single HGX node (up to 8 GPUs), while InfiniBand or RoCEv2 provide the high-bandwidth, low-latency interconnects required to scale inference across multiple HGX servers.

GPUDirect RDMA in VCF

Enabling GPUDirect RDMA within VCF requires specific configurations, such as enabling Access Control Services (ACS) in ESXi and Address Translation Services (ATS) on ConnectX-7 NICs. ATS allows direct DMA transactions between PCIe devices, bypassing the Root Complex and restoring near bare-metal performance in virtualized environments.

Determining Server Requirements

A practical framework is included for calculating the minimum number of HGX servers required for LLM inference. Factors such as num_attention_heads and context length are taken into account, with a reference table showing hardware requirements for popular LLMs (e.g., Llama-3.1-405B, DeepSeek-R1, Llama-4-Series, Kimi-K2, etc). For instance, DeepSeek-R1 and Llama-3.1-405B for full context length both require at least two H00-HGX servers.

Architecture Overview

The solution architecture is broken down into the VKS Cluster, Supervisor Cluster, and critical Service VMs running the NVIDIA Fabric Manager. It highlights the use of Dynamic DirectPath I/O to ensure GPUs and NICs are directly accessible to workload VKS nodes, while NVSwitches are passthrough to Service VMs.

Deployment Workflow and Best Practices

An 8-step deployment workflow is presented, covering:

  1. Hardware and firmware preparation (including BIOS and firmware updates)
  2. ESXi configurations for GPUDirect RDMA enablement
  3. Service VM deployment
  4. VKS cluster setup
  5. Operator installation (NVIDIA Network and GPU Operators)
  6. Storage and model download procedures
  7. LLM deployment using SGLang with Leader-Worker Sets (LWS)
  8. Post-deployment validation

Practical Examples and Configurations

Concrete examples are included, such as:

  • YAML manifests for deploying a VKS cluster with GPU-enabled worker nodes
  • LeaderWorkerSet configuration for running DeepSeek-R1-0528, Llama-3.1-405B-Instruct, and Qwen3-235B-A22B-thinking on two HGX nodes
  • Customized NCCL topology files for maximizing performance in virtualized environments

Performance Verification

Steps are provided for verifying RDMA, GPUDirect RDMA, and NCCL performance across multi-nodes. Benchmarking results are included for models such as DeepSeek-R1-0528 and Llama-3.1-405B-Instruct on 2 HGX nodes, using the GenAI-Perf stress test tool.

For a deeper dive into the technical specifics and deployment procedures, we encourage you to read the full white paper: https://www.vmware.com/docs/vcf-distributed-infer

Ready to get started on your AI and ML journey? Check out these helpful resources:

  • Complete this form to contact us!
  • Read the VMware Private AI solution brief.
  • Learn more about VMware Private AI.
  • Connect with us on Twitter at @VMwareVCF and on LinkedIn at VMware VCF.

Related

VMware LLC published this content on September 16, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 16, 2025 at 15:30 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]