SageMaker Inference speeds generative AI scale-out with container caching
Amazon SageMaker Inference now supports container image caching, reducing generative AI scale-out time by up to half. This feature pre-caches container images for new instances, eliminating the latency associated with pulling large images from ECR during scale-out events. The enhancement is automatically applied to existing SageMaker Inference configurations and is available in all AWS commercial regions.
- →Automatic container image caching for faster inference scaling
- →Seamless integration and broad support
- →Comprehensive scaling optimization suite for generative AI
Features (1) ›
- Automatic container image caching for faster inference scaling
SageMaker Inference now pre-caches container images during scale-out events, enabling up to 2x faster scaling for generative AI models. This eliminates cold-start latency by ensuring new instances have the container image locally available.
Enhancements (1) ›
- Seamless integration and broad support
Customers require no changes to benefit from container image caching, as it automatically caches the specified image URI. The feature supports accelerator instance types, single-model endpoints, and inference component-based endpoints.
Notes (1) ›
- Comprehensive scaling optimization suite for generative AI
This launch complements SageMaker Inference's existing scaling optimizations, including sub-minute concurrency metrics and instance-store container caching. Together, these provide a suite of tools for faster and more efficient generative AI inference.
https://aws.amazon.com/about-aws/whats-new/2026/06/sagemakerai-inf-scale-out-time