aws AWS What's New ·

SageMaker Inference speeds generative AI scale-out with container caching

aiawsgaengineer
feature patch

Amazon SageMaker Inference now supports container image caching, reducing generative AI scale-out time by up to half. This feature pre-caches container images for new instances, eliminating the latency associated with pulling large images from ECR during scale-out events. The enhancement is automatically applied to existing SageMaker Inference configurations and is available in all AWS commercial regions.

  • Automatic container image caching for faster inference scaling
  • Seamless integration and broad support
  • Comprehensive scaling optimization suite for generative AI
Features (1)
  • Automatic container image caching for faster inference scaling

    SageMaker Inference now pre-caches container images during scale-out events, enabling up to 2x faster scaling for generative AI models. This eliminates cold-start latency by ensuring new instances have the container image locally available.

Enhancements (1)
  • Seamless integration and broad support

    Customers require no changes to benefit from container image caching, as it automatically caches the specified image URI. The feature supports accelerator instance types, single-model endpoints, and inference component-based endpoints.

Notes (1)
  • Comprehensive scaling optimization suite for generative AI

    This launch complements SageMaker Inference's existing scaling optimizations, including sub-minute concurrency metrics and instance-store container caching. Together, these provide a suite of tools for faster and more efficient generative AI inference.

Read the original announcement →

https://aws.amazon.com/about-aws/whats-new/2026/06/sagemakerai-inf-scale-out-time