May 2, 2025 | 3 minute read
Ejaz Akram
Senior Cloud Solutions Architect
Ricardo Anda
Cloud Solution Architect - Multicloud & Networking
Oracle AI infrastructure provides best in class service for any AI workload or application. Compute, network and storage services work hand in hand to provide solid building blocks to build AI infrastructure for running any advanced AI applications. Oracle Cloud Infrastructure Kubernetes Engine's (OKE) tight integration gives scalability and containerization for better productivity and manageability, to excel in orchestrating containers seamlessly with the AI infrastructure.
In this blog we cover how Oracle Cloud Infrastructure (OCI) services are stitched together with OCI networking, built from ground up without compromising on performance and security features. We'll also discuss how OCI provides an optimised AI networking glue for customers to run large language model models (LLM), generative artificial intelligence (GenAI) applications, physics simulations and more.
Oracle AI Infrastructure provides one of the highest performance, lowest cost graphics processing unit (GPU) cluster technologies in the world with remote direct memory access (RDMA) as part of lossless, nonblocking network architecture, local non-volatile memory express (NVMe) storage for containerized applications, high performance scalable file system storage for models training / inferencing, and powerful bare metal compute underpinned by peripheral component interconnect express (PCIe) interfaces to drive all components together at scale.
AI Infra Networking
OKE provides full orchestration for scalability and manageability connected over the cloud fabric to provide seamless integration to other cloud services and the GPU/Kubernetes cluster.
OCI File Storage Service with Lustre at scale gives deep integration with OKE and the file system over the cloud fabric network, that also gives access to GPUs bare metal nodes with the power of thousands of GPUs in a super cluster.
A Kubernetes node directly connects to the GPUs over the PCIe/NVMe interface within the bare metal compute. At the same time, the NVIDIA link provides seamless communication between every GPU within each bare metal node of a cluster at over 900 Gbps of link speed.
GPU cluster nodes are knitted together with high throughput and low latency RDMA over a converged ethernet version 2 (RoCE v2) network. They provide staggering performance capabilities and can scale to meet large AI application demands from training models to inferencing. OCI AI Infrastructure includes:
Ultrafast and scalable networking
Custom-designed RDMA over converged ethernet protocol (RoCE v2)
2.5 to 9.1 microseconds of latency for cluster networking
Up to 3,200 Gb/sec of cluster network bandwidth
Up to 200 Gb/sec of front-end network bandwidth
3-tier Clos topology
Lossless network
Supercharged compute
Bare metal instances without any hypervisor overhead
Accelerated by NVIDIA Blackwell (GB200 NVL72, HGX B200), Hopper (H200, H100), and previous-generation GPUs
Option to use AMD MI300X GPUs
Data processing unit (DPU) for built-in hardware acceleration
Massive capacity and high-throughput storage
Local storage: up to 61.44 TB of NVMe SSD capacity
File storage: OCI File Storage with Lustre scale up to 20 petabytes (PB).
High sustained performance for each terabyte (TB) of provisioned capacity.
125 MBps per provisioned TB
250 MBps per provisioned TB
500 MBps per provisioned TB
1000 MBps per provisioned TB
Oracle AI infrastructure provides customers superior performance and highly scalable networking. Customers can access large GPU superclusters without compromising on network throughput, latency and security features while building application layers containerised with OKE for scale and simple manageability.
For more information and details on how to build better architecture with OCI AI infrastructure see the following links:
OCI AI Infrastructure
Blog: First Principles: Inside Zettascale OCI Superclusters for Next-gen AI
Blog: Generally Available: Fully Managed Lustre File Storage in the Cloud
Blog: Deploying an HPC cluster with RDMA network on OCI OKE and File Storage service mount
Blog: Announcing the General Availability of NVIDIA GPU Device Plugin Add-On for OKE