Introducing Amazon SageMaker HyperPod, a purpose-built infrastructure for distributed training at scale

via aws.amazon.com => original post link

Today, we are introducing Amazon SageMaker HyperPod, which helps reducing time to train foundation models (FMs) by providing a purpose-built infrastructure for distributed training at scale. You can now use SageMaker HyperPod to train FMs for weeks or even months while SageMaker actively monitors the cluster health and provides automated node and job resiliency by replacing faulty nodes and resuming model training from a checkpoint.

The clusters come preconfigured with SageMaker’s distributed training libraries that help you split your training data and model across all the nodes to process them in parallel and fully utilize the cluster’s compute and network infrastructure. You can further customize your training environment by installing additional frameworks, debugging tools, and optimization libraries.