noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

City and County of Broomfield, CO

Health Ambassador and Nonprofit Grant Applications Open Aug. 18
Indiana University of Pennsylvania

IUP Forensic Science Major to Begin Accepting Students for Fall 2025
Grindr Inc.

Goodbye Brat Summer, Hello Pit Summer

Pure Storage Inc.

06/03/2025 | Press release | Distributed by Public on 06/03/2025 14:22

Pure Storage FlashBlade and PyTorch Asynchronous Checkpointing: Accelerating Training for Large AI Models

Training large AI models comes with trade-offs, and one of the most critical is striking the right balance between performance and resilience. Checkpointing is essential for fault tolerance, but the traditional synchronous approach forces training to pause while the model state is saved. For billion-parameter models and up, those pauses can stretch into minutes, slowing down developer iteration and leaving expensive GPUs sitting idle when they should be training.

Asynchronous checkpointing offers a smarter alternative. By decoupling the checkpoint process from the critical training path, it lets checkpointing happen in the background, keeping expensive GPUs busy, and training workflows uninterrupted. When paired with the high-throughput, scale-out architecture of Pure Storage® FlashBlade®, checkpoint overhead drops significantly-often by 90% or more-without compromising reliability. It's a practical way to maintain training momentum at scale.

PyTorch Asynchronous Checkpointing

PyTorch's distributed async checkpointing introduces a major shift in how a model state is handled. Instead of halting training to write out checkpoints, it enables background saving while computation continues. This not only reduces idle GPU time but also enables each training process to write its checkpoint data independently, distributing I/O across nodes and reducing pressure on shared storage systems.

The result is faster training cycles, better resource utilization, and smoother scaling for large workloads. Frequent checkpointing is best practice for fault recovery and experimentation, but traditional methods make it too costly. Async checkpointing changes the equation, letting teams save state as often as they need without breaking training flow.

checkpoint times async

Key Mechanisms

Asynchronous checkpointing breaks up the traditional, all-at-once save process into two coordinated steps:

GPU-to-CPU transfer: Model state is quickly moved from GPU memory to CPU memory, allowing training to continue without delay.

Asynchronous persistence: Once the data is on the CPU, dedicated threads handle saving it to disk, keeping the GPUs free to focus on training.

Under the hood, PyTorch uses separate process groups to manage checkpointing, so it doesn't interfere with ongoing distributed training tasks.

Think of it like a Formula 1 pit stop: Your expensive GPU is the race car, optimized for speed, while the CPU is the pit crew, built for handling quick maintenance. You don't want your $40,000 GPU engine idling while saving data to disk. This design keeps the car on the track while the crew takes care of business.

In practice, it means AI teams no longer need to choose between performance and resilience. Just like in racing, where speed and maintenance can coexist with the right pit strategy, asynchronous checkpointing lets model training continue while state-saving happens in the background.

Implementation Benefits

Minimal Training Disruption

Training only pauses briefly to transfer model state from GPU to CPU memory. This means AI practitioners can maintain momentum during long training runs without losing valuable GPU cycles, which are especially important for time-sensitive model development or iterative experimentation.

Increased Checkpoint Frequency

Since checkpointing no longer stalls the entire training pipeline, teams can save model state more often. For practitioners, this opens the door to faster iteration, easier experimentation, and better protection against rare but costly training failures, like node crashes or out-of-memory errors.

Improved Fault Tolerance

More frequent checkpoints reduce recovery time if a job fails. For infrastructure leaders, this translates to faster job restarts, fewer lost compute hours, and better service-level predictability across shared clusters. It also reduces the need for overly conservative job scheduling, freeing up capacity for more active workloads.

Better Resource Utilization

GPUs keep working while CPU threads handle disk writes. This ensures maximum return on GPU investment by keeping compute utilization high and avoiding unnecessary I/O contention on shared storage systems. For storage admins and infrastructure VPs, it means less pressure on IOPS, more predictable I/O behavior, and fewer bottlenecks that can affect other users on the system.

Pure Storage FlashBlade: Amplifying Performance

While PyTorch's asynchronous checkpointing significantly reduces training interruptions, storage infrastructure determines how far those gains can go. In high-throughput, multi-node AI environments, Pure Storage FlashBlade is uniquely suited to maximize the value of async checkpointing.

Designed for Fast Metadata and High Throughput

While async checkpointing can reduce training disruption on its own, FlashBlade unlocks its full potential. Its architecture handles the metadata-heavy operations of large-scale training with consistently low latency, even during intense write bursts.

This translates into:

Faster checkpoint completion: Background threads can write model state to disk quickly, often achieving 10 times higher write throughput compared to traditional checkpointing setups.

No backlogs or delays: With low-latency I/O, checkpoints don't pile up or compete with other training operations, keeping the system responsive and training on schedule.

Reliable scheduling: Predictable I/O performance allows teams to plan checkpoint strategies with confidence, without worrying about unexpected slowdowns or stalled training loops.

Built for Parallelism at Scale

The distributed, scale-out architecture of FlashBlade spreads data across multiple blades, which allows:

Parallel writes without bottlenecks: Multiple nodes can write checkpoints at the same time, avoiding I/O contention.

Consistent performance as you grow: Adding training nodes doesn't overload the storage layer because FlashBlade scales with your compute footprint, maintaining performance under increased demand.

Fast metadata coordination: Quick metadata access supports efficient checkpoint orchestration across large distributed training jobs.

Performance That Scales with Your Needs

Pairing PyTorch's asynchronous checkpointing with Pure Storage FlashBlade removes storage as a bottleneck in the AI training pipeline. Instead of designing around I/O limitations or enduring long pauses to persist model states, teams can now train at full speed with checkpoints happening quietly in the background.

This integration delivers:

Near-continuous GPU utilization, even during frequent checkpoints

Flexible checkpointing strategies, tailored to workload requirements

Infrastructure scaling driven by compute needs, not storage constraints

It's not just about faster I/O-it's about keeping your most valuable assets, like GPUs, working as efficiently as possible. Just like you wouldn't park a race car to rotate its tires mid-race, async checkpointing ensures training stays on track while lightweight systems handle the save.

The combination of PyTorch async checkpointing and FlashBlade represents a shift in how large-scale training infrastructure is designed. By cutting checkpoint overhead by 10 times or more and delivering consistent, low-latency performance at scale, this solution helps teams get more from their GPUs and speed up model development cycles.

For storage administrators and infrastructure leaders, it brings predictable I/O behavior, simplified management, and the confidence to scale training workloads without compromising performance. For AI engineers, it means smoother training runs, faster iteration, and the ability to push larger models into production faster and more reliably.

As AI workloads continue to scale, the partnership between smart software design and high-performance storage becomes essential. With async checkpointing and Pure Storage FlashBlade, storage is no longer a limiting factor-it's a competitive advantage.

Pure Storage Inc. published this content on June 03, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on June 03, 2025 at 20:22 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at support@pubt.io

related announcements

News

Science and Technology

Pure Storage Inc.

Pure Storage FlashBlade and PyTorch Asynchronous Checkpointing: Accelerating Training for Large AI Models