aws AWS What's New · 4h ago

SageMaker Inference speeds generative AI scale-out with container caching

aiawsgaengineer

feature patch

Amazon SageMaker Inference now supports container image caching, reducing generative AI scale-out time by up to half. This feature pre-caches container images for new instances, eliminating the latency associated with pulling large images from ECR during scale-out events. The enhancement is automatically applied to existing SageMaker Inference configurations and is available in all AWS commercial regions.

→Automatic container image caching for faster inference scaling
→Seamless integration and broad support
→Comprehensive scaling optimization suite for generative AI

Features (1) ›

Automatic container image caching for faster inference scaling

SageMaker Inference now pre-caches container images during scale-out events, enabling up to 2x faster scaling for generative AI models. This eliminates cold-start latency by ensuring new instances have the container image locally available.

Enhancements (1) ›

Seamless integration and broad support

Customers require no changes to benefit from container image caching, as it automatically caches the specified image URI. The feature supports accelerator instance types, single-model endpoints, and inference component-based endpoints.

Notes (1) ›

Comprehensive scaling optimization suite for generative AI

This launch complements SageMaker Inference's existing scaling optimizations, including sub-minute concurrency metrics and instance-store container caching. Together, these provide a suite of tools for faster and more efficient generative AI inference.

Read the original announcement →

https://aws.amazon.com/about-aws/whats-new/2026/06/sagemakerai-inf-scale-out-time