GitHub - aws/sagemaker-hyperpod-cli: A CLI tool that helps manage training jobs on the SageMaker HyperPod clusters orchestrated by Amazon EKS

SageMaker HyperPod command-line interface

The Amazon SageMaker HyperPod command-line interface (HyperPod CLI) is a tool that helps manage clusters, training jobs, and inference endpoints on the SageMaker HyperPod clusters orchestrated by Amazon EKS.

This documentation serves as a reference for the available HyperPod CLI commands. For a comprehensive user guide, see Orchestrating SageMaker HyperPod clusters with Amazon EKS in the Amazon SageMaker Developer Guide.

Note: Old hyperpodCLI V2 has been moved to release_v2 branch. Please refer release_v2 branch for usage.

Overview

The SageMaker HyperPod CLI is a tool that helps create training jobs and inference endpoint deployments to the Amazon SageMaker HyperPod clusters orchestrated by Amazon EKS. It provides a set of commands for managing the full lifecycle of jobs, including create, describe, list, and delete operations, as well as accessing pod and operator logs where applicable. The CLI is designed to abstract away the complexity of working directly with Kubernetes for these core actions of managing jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS.

Prerequisites

Region Configuration

Important: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.

Prerequisites for Training

HyperPod CLI currently supports starting PyTorchJobs. To start a job, you need to install Training Operator first.
- You can follow pytorch operator doc to install it.

Prerequisites for Inference

HyperPod CLI supports creating Inference Endpoints through jumpstart and through custom Endpoint config
- You can follow inference operator doc to install it.

Platform Support

SageMaker HyperPod CLI currently supports Linux and MacOS platforms. Windows platform is not supported now.

ML Framework Support

SageMaker HyperPod CLI currently supports start training job with:

PyTorch ML Framework. Version requirements: PyTorch >= 1.10

Installation

Make sure that your local python version is 3.8, 3.9, 3.10 or 3.11.
Install the sagemaker-hyperpod-cli package.
```
pip install sagemaker-hyperpod
```
Verify if the installation succeeded by running the following command.

Usage

The HyperPod CLI provides the following commands:

Getting Started

Getting Cluster information

This command lists the available SageMaker HyperPod clusters and their capacity information.

Option	Type	Description
`--region <region>`	Optional	The region that the SageMaker HyperPod and EKS clusters are located. If not specified, it will be set to the region from the current AWS account credentials.
`--namespace <namespace>`	Optional	The namespace that users want to check the quota with. Only the SageMaker managed namespaces are supported.
`--output <json\|table>`	Optional	The output format. Available values are `table` and `json`. The default value is `json`.
`--debug`	Optional	Enable debug mode for detailed logging.

Connecting to a Cluster

This command configures the local Kubectl environment to interact with the specified SageMaker HyperPod cluster and namespace.

hyp set-cluster-context --cluster-name <cluster-name>

Option	Type	Description
`--cluster-name <cluster-name>`	Required	The SageMaker HyperPod cluster name to configure with.
`--namespace <namespace>`	Optional	The namespace that you want to connect to. If not specified, Hyperpod cli commands will auto discover the accessible namespace.
`--region <region>`	Optional	The AWS region where the HyperPod cluster resides.
`--debug`	Optional	Enable debug mode for detailed logging.

Getting Cluster Context

Get all the context related to the current set Cluster

Option	Type	Description
`--debug`	Optional	Enable debug mode for detailed logging.

Cluster Management

Important: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.

Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.

Initialize Cluster Configuration

Initialize a new cluster configuration in the current directory:

Important: The resource_name_prefix parameter in the generated config.yaml file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.

Configure Cluster Parameters

Configure cluster parameters interactively or via command line:

hyp configure --resource-name-prefix my-cluster --stage prod

Validate Configuration

Validate the configuration file syntax:

Create Cluster Stack

Create the cluster stack using the configured parameters:

hyp create --region <region>

Note: The region flag is optional. If not provided, the command will use the default region from your AWS credentials configuration.

List Cluster Stacks

Option	Type	Description
`--region <region>`	Optional	The AWS region to list stacks from.
`--status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']"`	Optional	Filter by stack status.
`--debug`	Optional	Enable debug mode for detailed logging.

Describe Cluster Stack

hyp describe cluster-stack <stack-name>

Option	Type	Description
`--region <region>`	Optional	The AWS region where the stack exists.
`--debug`	Optional	Enable debug mode for detailed logging.

Delete Cluster Stack

Delete a HyperPod cluster stack. Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone.

 hyp delete cluster-stack <stack-name>

Option	Type	Description
`--region <region>`	Required	The AWS region where the stack exists.
`--retain-resources S3Bucket-TrainingData,EFSFileSystem-Models`	Optional	Comma-separated list of logical resource IDs to retain during deletion (only works on DELETE_FAILED stacks). Resource names are shown in failed deletion output, or use AWS CLI: `aws cloudformation list-stack-resources STACK_NAME --region REGION`.
`--debug`	Optional	Enable debug mode for detailed logging.

Update Existing Cluster

hyp update cluster --cluster-name my-cluster \
    --instance-groups '[{"InstanceCount":2,"InstanceGroupName":"worker-nodes","InstanceType":"ml.m5.large"}]' \
    --node-recovery Automatic

Reset Configuration

Reset configuration to default values:

Training

Option 1: Create Pytorch job through init experience

Initialize Pytorch Job Configuration

Initialize a new pytorch job configuration in the current directory:

Configure Pytorch Job Parameters

Configure pytorch job parameters interactively or via command line:

hyp configure --job-name my-pytorch-job

Validate Configuration

Validate the configuration file syntax:

Create Pytorch Job

Create the pytorch job using the configured parameters:

Option 2: Create Pytorch job through create command

hyp create hyp-pytorch-job \
    --version 1.0 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '[python, train.py]' \
    --args '[--epochs=10, --batch-size=32]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --accelerators 8 \
    --vcpu 96.0 \
    --memory 1152.0 \
    --accelerators-limit 8 \
    --vcpu-limit 96.0 \
    --memory-limit 1152.0 \
    --preferred-topology "topology.kubernetes.io/zone=us-west-2a" \
    --volume name=model-data,type=hostPath,mount_path=/data,path=/data \
    --volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false

Example with accelerator parititons:

hyp create hyp-pytorch-job \
    --version 1.1 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '[python, train.py]' \
    --args '[--epochs=10, --batch-size=32]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --accelerator-partition-type "mig-1g.5gb" \
    --accelerator-partition-count 2 \
    --accelerator-partition-limit 4 \
    --vcpu 96.0 \
    --memory 1152.0 \
    --vcpu-limit 96.0 \
    --memory-limit 1152.0 \
    --preferred-topology "topology.kubernetes.io/zone=us-west-2a" \
    --volume name=model-data,type=hostPath,mount_path=/data,path=/data \
    --volume name=training-output,type=pvc,mount_path=/data2,claim_name=my-pvc,read_only=false

Parameter	Type	Required	Description
`--job-name`	TEXT	Yes	Unique name for the training job (1-63 characters, alphanumeric with hyphens)
`--image`	TEXT	Yes	Docker image URI containing your training code
`--namespace`	TEXT	No	Kubernetes namespace
`--command`	ARRAY	No	Command to run in the container (array of strings)
`--args`	ARRAY	No	Arguments for the entry script (array of strings)
`--environment`	OBJECT	No	Environment variables as key-value pairs
`--pull-policy`	TEXT	No	Image pull policy (Always, Never, IfNotPresent)
`--instance-type`	TEXT	No	Instance type for training
`--node-count`	INTEGER	No	Number of nodes (minimum: 1)
`--tasks-per-node`	INTEGER	No	Number of tasks per node (minimum: 1)
`--label-selector`	OBJECT	No	Node label selector as key-value pairs
`--deep-health-check-passed-nodes-only`	BOOLEAN	No	Schedule pods only on nodes that passed deep health check (default: false)
`--scheduler-type`	TEXT	No	Scheduler type
`--queue-name`	TEXT	No	Queue name for job scheduling (1-63 characters, alphanumeric with hyphens)
`--priority`	TEXT	No	Priority class for job scheduling
`--max-retry`	INTEGER	No	Maximum number of job retries (minimum: 0)
`--volume`	ARRAY	No	List of volume configurations (Refer Volume Configuration for detailed parameter info)
`--service-account-name`	TEXT	No	Service account name
`--accelerators`	INTEGER	No	Number of accelerators a.k.a GPUs or Trainium Chips
`--vcpu`	FLOAT	No	Number of vCPUs
`--memory`	FLOAT	No	Amount of memory in GiB
`--accelerators-limit`	INTEGER	No	Limit for the number of accelerators a.k.a GPUs or Trainium Chips
`--vcpu-limit`	FLOAT	No	Limit for the number of vCPUs
`--memory-limit`	FLOAT	No	Limit for the amount of memory in GiB
`--accelerator-partition-type`	TEXT	No	Type of accelerator partition (e.g., mig-1g.5gb, mig-2g.10gb, mig-3g.20gb, mig-4g.20gb, mig-7g.40gb)
`--accelerator-partition-count`	INTEGER	No	Number of accelerator partitions to request (minimum: 1)
`--accelerator-partition-limit`	INTEGER	No	Limit for the number of accelerator partitions (minimum: 1)
`--preferred-topology`	TEXT	No	Preferred topology annotation for scheduling
`--required-topology`	TEXT	No	Required topology annotation for scheduling
`--max-node-count`	INTEGER	No	Maximum number of nodes
`--elastic-replica-increment-step`	INTEGER	No	Scaling step size for elastic training. Provide either this or elastic-replica-discrete-values
`--elastic-graceful-shutdown-timeout-in-seconds`	INTEGER	No	Graceful shutdown timeout in seconds for elastic scaling operations
`--elastic-scaling-timeout-in-seconds`	INTEGER	No	Scaling timeout for elastic training
`--elastic-scale-up-snooze-time-in-seconds`	INTEGER	No	Timeout period after job restart during which no scale up/workload admission is allowed
`--elastic-replica-discrete-values`	ARRAY	No	Alternative to elastic-replica-increment-step. Provides exact values for total replicas count (array of integers)
`--debug`	FLAG	No	Enable debug mode (default: false)

List Available Accelerator Partition Types

This command lists the available accelerator partition types on the cluster for a specific instance type.

hyp list-accelerator-partition-type --instance-type <instance-type>

List Training Jobs

Describe a Training Job

hyp describe hyp-pytorch-job --job-name <job-name>

Listing Pods

This command lists all the pods associated with a specific training job.

hyp list-pods hyp-pytorch-job --job-name <job-name>

job-name (string) - Required. The name of the job to list pods for.

Accessing Logs

This command retrieves the logs for a specific pod within a training job.

hyp get-logs hyp-pytorch-job --pod-name <pod-name> --job-name <job-name>

Parameter	Required	Description
`--job-name`	Yes	The name of the job to get the log for.
`--pod-name`	Yes	The name of the pod to get the log from.
`--namespace`	No	The namespace of the job. Defaults to 'default'.
`--container`	No	The container name to get logs from.

Get Operator Logs

hyp get-operator-logs hyp-pytorch-job --since-hours 0.5

Delete a Training Job

hyp delete hyp-pytorch-job --job-name <job-name>

Inference

Jumpstart Endpoint Creation

Option 1: Create jumpstart endpoint through init experience

Initialize Jumpstart Endpoint Configuration

Initialize a new jumpstart endpoint configuration in the current directory:

hyp init hyp-jumpstart-endpoint

Configure Jumpstart Endpoint Parameters

Configure jumpstart endpoint parameters interactively or via command line:

hyp configure --endpoint-name my-jumpstart-endpoint

Validate Configuration

Validate the configuration file syntax:

Create Jumpstart Endpoint

Create the jumpstart endpoint using the configured parameters:

Option 2: Create jumpstart endpoint through create command

Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint

hyp create hyp-jumpstart-endpoint \
    --version 1.0 \
    --model-id jumpstart-model-id\
    --instance-type ml.g5.8xlarge \
    --endpoint-name endpoint-jumpstart

Parameter	Type	Required	Description
`--model-id`	TEXT	Yes	JumpStart model identifier (1-63 characters, alphanumeric with hyphens)
`--instance-type`	TEXT	Yes	EC2 instance type for inference (must start with "ml.")
`--namespace`	TEXT	No	Kubernetes namespace
`--metadata-name`	TEXT	No	Name of the jumpstart endpoint object
`--accept-eula`	BOOLEAN	No	Whether model terms of use have been accepted (default: false)
`--model-version`	TEXT	No	Semantic version of the model (e.g., "1.0.0", 5-14 characters)
`--endpoint-name`	TEXT	No	Name of SageMaker endpoint (1-63 characters, alphanumeric with hyphens)
`--tls-certificate-output-s3-uri`	TEXT	No	S3 URI to write the TLS certificate (optional)
`--debug`	FLAG	No	Enable debug mode (default: false)

Invoke a JumpstartModel Endpoint

hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name endpoint-jumpstart \
    --body '{"inputs":"What is the capital of USA?"}'

Managing an Endpoint

hyp list hyp-jumpstart-endpoint
hyp describe hyp-jumpstart-endpoint --name endpoint-jumpstart

List Pods

hyp list-pods hyp-jumpstart-endpoint

Get Logs

hyp get-logs hyp-jumpstart-endpoint --pod-name <pod-name>

Get Operator Logs

hyp get-operator-logs hyp-jumpstart-endpoint --since-hours 0.5

Deleting an Endpoint

hyp delete hyp-jumpstart-endpoint --name endpoint-jumpstart

Custom Endpoint Creation

Option 1: Create custom endpoint through init experience

Initialize Custom Endpoint Configuration

Initialize a new custom endpoint configuration in the current directory:

hyp init hyp-custom-endpoint

Configure Custom Endpoint Parameters

Configure custom endpoint parameters interactively or via command line:

hyp configure --endpoint-name my-custom-endpoint

Validate Configuration

Validate the configuration file syntax:

Create Custom Endpoint

Create the custom endpoint using the configured parameters:

Option 2: Create custom endpoint through create command

hyp create hyp-custom-endpoint \
    --version 1.0 \
    --endpoint-name endpoint-custom \
    --model-name my-pytorch-model \
    --model-source-type s3 \
    --model-location my-pytorch-training \
    --model-volume-mount-name test-volume \
    --s3-bucket-name your-bucket \
    --s3-region us-east-1 \
    --instance-type ml.g5.8xlarge \
    --image-uri 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:latest \
    --container-port 8080

Parameter	Type	Required	Description
`--instance-type`	TEXT	Yes	EC2 instance type for inference (must start with "ml.")
`--model-name`	TEXT	Yes	Name of model to create on SageMaker (1-63 characters, alphanumeric with hyphens)
`--model-source-type`	TEXT	Yes	Model source type ("s3" or "fsx")
`--image-uri`	TEXT	Yes	Docker image URI for inference
`--container-port`	INTEGER	Yes	Port on which model server listens (1-65535)
`--model-volume-mount-name`	TEXT	Yes	Name of the model volume mount
`--namespace`	TEXT	No	Kubernetes namespace
`--metadata-name`	TEXT	No	Name of the custom endpoint object
`--endpoint-name`	TEXT	No	Name of SageMaker endpoint (1-63 characters, alphanumeric with hyphens)
`--env`	OBJECT	No	Environment variables as key-value pairs
`--metrics-enabled`	BOOLEAN	No	Enable metrics collection (default: false)
`--model-version`	TEXT	No	Version of the model (semantic version format)
`--model-location`	TEXT	No	Specific model data location
`--prefetch-enabled`	BOOLEAN	No	Whether to pre-fetch model data (default: false)
`--tls-certificate-output-s3-uri`	TEXT	No	S3 URI for TLS certificate output
`--fsx-dns-name`	TEXT	No	FSx File System DNS Name
`--fsx-file-system-id`	TEXT	No	FSx File System ID
`--fsx-mount-name`	TEXT	No	FSx File System Mount Name
`--s3-bucket-name`	TEXT	No	S3 bucket location
`--s3-region`	TEXT	No	S3 bucket region
`--model-volume-mount-path`	TEXT	No	Path inside container for model volume (default: "/opt/ml/model")
`--resources-limits`	OBJECT	No	Resource limits for the worker
`--resources-requests`	OBJECT	No	Resource requests for the worker
`--dimensions`	OBJECT	No	CloudWatch Metric dimensions as key-value pairs
`--metric-collection-period`	INTEGER	No	Period for CloudWatch query (default: 300)
`--metric-collection-start-time`	INTEGER	No	StartTime for CloudWatch query (default: 300)
`--metric-name`	TEXT	No	Metric name to query for CloudWatch trigger
`--metric-stat`	TEXT	No	Statistics metric for CloudWatch (default: "Average")
`--metric-type`	TEXT	No	Type of metric for HPA ("Value" or "Average", default: "Average")
`--min-value`	NUMBER	No	Minimum metric value for empty CloudWatch response (default: 0)
`--cloud-watch-trigger-name`	TEXT	No	Name for the CloudWatch trigger
`--cloud-watch-trigger-namespace`	TEXT	No	AWS CloudWatch namespace for the metric
`--target-value`	NUMBER	No	Target value for the CloudWatch metric
`--use-cached-metrics`	BOOLEAN	No	Enable caching of metric values (default: true)
`--invocation-endpoint`	TEXT	No	Invocation endpoint path (default: "invocations")
`--debug`	FLAG	No	Enable debug mode (default: false)

Invoke a Custom Inference Endpoint

hyp invoke hyp-custom-endpoint \
    --endpoint-name endpoint-custom-pytorch \
    --body '{"inputs":"What is the capital of USA?"}'

Managing an Endpoint

hyp list hyp-custom-endpoint
hyp describe hyp-custom-endpoint --name endpoint-custom

List Pods

hyp list-pods hyp-custom-endpoint

Get Logs

hyp get-logs hyp-custom-endpoint --pod-name <pod-name>

Get Operator Logs

hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5

Deleting an Endpoint

hyp delete hyp-custom-endpoint --name endpoint-custom

Space

Create a Space

hyp create hyp-space \
    --name myspace \
    --namespace default \
    --display-name "My Space"

Parameter	Type	Required	Description
`--name`	TEXT	Yes	Space name
`--display-name`	TEXT	Yes	Display Name of the space
`--namespace`	TEXT	No	Kubernetes namespace
`--image`	TEXT	No	Image specifies the container image to use
`--desired-status`	TEXT	No	DesiredStatus specifies the desired operational status
`--ownership-type`	TEXT	No	OwnershipType specifies who can modify the space. 'Public' means anyone with RBAC permissions can update/delete the space. 'OwnerOnly' means only the creator can update/delete the space.
`--node-selector`	TEXT	No	NodeSelector specifies node selection constraints for the space pod (JSON string)
`--affinity`	TEXT	No	Affinity specifies node affinity and anti-affinity rules for the space pod (JSON string)
`--tolerations`	TEXT	No	Tolerations specifies tolerations for the space pod to schedule on nodes with matching taints (JSON string)
`--lifecycle`	TEXT	No	Lifecycle specifies actions that the management system should take in response to container lifecycle events (JSON string)
`--app-type`	TEXT	No	AppType specifies the application type for this workspace
`--service-account-name`	TEXT	No	ServiceAccountName specifies the name of the ServiceAccount to use for the workspace pod
`--idle-shutdown`	TEXT	No	Idle shutdown configuration. Format: --idle-shutdown enabled=,idleTimeoutInMinutes=,detection=
`--template-ref`	TEXT	No	TemplateRef references a WorkspaceTemplate to use as base configuration. Format: --template-ref name=,namespace=
`--container-config`	TEXT	No	Container configuration. Format: --container-config command=,args=<arg1;arg2>
`--storage`	TEXT	No	Storage configuration. Format: --storage storageClassName=,size=,mountPath=
`--volume`	TEXT	No	Volume configuration. Format: --volume name=,mountPath=,persistentVolumeClaimName=<pvc_name>. Use multiple --volume flags for multiple volumes.
`--accelerator-partition-count`	TEXT	No	Fractional GPU partition count, e.g. '1'
`--accelerator-partition-type`	TEXT	No	Fractional GPU partition type, e.g. 'mig-3g.20gb'
`--gpu-limit`	TEXT	No	GPU resource limit, e.g. '1'
`--gpu`	TEXT	No	GPU resource request, e.g. '1'
`--memory-limit`	TEXT	No	Memory resource limit, e.g. '2Gi'
`--memory`	TEXT	No	Memory resource request, e.g. '2Gi'
`--cpu-limit`	TEXT	No	CPU resource limit, e.g. '500m'
`--cpu`	TEXT	No	CPU resource request, e.g. '500m'

List Spaces

# List spaces in default namespace
hyp list hyp-space

# List spaces in specific namespace
hyp list hyp-space --namespace my-namespace

# List spaces across all namespaces
hyp list hyp-space --all-namespaces

# List spaces with JSON output
hyp list hyp-space --output json

Describe a Space

hyp describe hyp-space --name myspace

Update a Space

hyp update hyp-space \
    --name myspace \
    --display-name "Updated Space Name"

Start/Stop a Space

hyp start hyp-space --name myspace
hyp stop hyp-space --name myspace

Get Logs

hyp get-logs hyp-space --name myspace

Delete a Space

hyp delete hyp-space --name myspace

Port Forward to a Space

Port forward to access a space from your local machine:

# Port forward with default port (8888)
hyp portforward hyp-space --name myspace

# Port forward with custom local port
hyp portforward hyp-space --name myspace --local-port 8080

Access the space via http://localhost:<local-port> after port forwarding is established. Press Ctrl+C to stop port forwarding.

Space Template Management

Create reusable space templates:

hyp create hyp-space-template --file template.yaml
hyp list hyp-space-template --all-namespaces
hyp describe hyp-space-template --name <template-name>
hyp update hyp-space-template --name <template-name> --file updated-template.yaml
hyp delete hyp-space-template --name <template-name>

Space Access

Create remote access to spaces:

hyp create hyp-space-access --name myspace --connection-type vscode-remote
hyp create hyp-space-access --name myspace --connection-type web-ui

SDK

Along with the CLI, we also have SDKs available that can perform the cluster management, training and inference functionalities that the CLI performs

Cluster Management SDK

Creating a Cluster Stack

from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

# Initialize cluster stack configuration
cluster_stack = HpClusterStack(
    stage="prod",
    resource_name_prefix="my-hyperpod",
    hyperpod_cluster_name="my-hyperpod-cluster",
    eks_cluster_name="my-hyperpod-eks",
    
    # Infrastructure components
    create_vpc_stack=True,
    create_eks_cluster_stack=True,
    create_hyperpod_cluster_stack=True,
    
    # Network configuration
    vpc_cidr="10.192.0.0/16",
    availability_zone_ids=["use2-az1", "use2-az2"],
    
    # Instance group configuration
    instance_group_settings=[
        {
            "InstanceCount": 1,
            "InstanceGroupName": "controller-group",
            "InstanceType": "ml.t3.medium",
            "TargetAvailabilityZoneId": "use2-az2"
        }
    ]
)

# Create the cluster stack
response = cluster_stack.create(region="us-east-2")

Listing Cluster Stacks

# List all cluster stacks
stacks = HpClusterStack.list(region="us-east-2")
print(f"Found {len(stacks['StackSummaries'])} stacks")

Describing a Cluster Stack

# Describe a specific cluster stack
stack_info = HpClusterStack.describe("my-stack-name", region="us-east-2")
print(f"Stack status: {stack_info['Stacks'][0]['StackStatus']}")

Monitoring Cluster Status

from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

stack = HpClusterStack()
response = stack.create(region="us-west-2")
status = stack.get_status(region="us-west-2")
print(status)

Deleting a Cluster Stack

# Delete with custom logger
import logging
logger = logging.getLogger(__name__)
HpClusterStack.delete("my-stack-name", region="us-west-2", logger=logger)

# Delete with retained resources (only works on DELETE_FAILED stacks)
HpClusterStack.delete("my-stack-name", retain_resources=["S3Bucket", "EFSFileSystem"])

Training SDK

Creating a Training Job

from sagemaker.hyperpod.training.hyperpod_pytorch_job import HyperPodPytorchJob
from sagemaker.hyperpod.training.config.hyperpod_pytorch_job_unified_config import (
    ReplicaSpec, Template, Spec, Containers, Resources, RunPolicy
)
from sagemaker.hyperpod.common.config.metadata import Metadata

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = 
[
    ReplicaSpec
    (
        name = "pod",  # Replica name
        template = Template
        (
            spec = Spec
            (
                containers =
                [
                    Containers
                    (
                        # Container name
                        name="container-name",  
                        
                        # Training image
                        image="123456789012.dkr.ecr.us-west-2.amazonaws.com/my-training-image:latest",  
                        
                        # Always pull image
                        image_pull_policy="Always",  
                        resources=Resources\
                        (
                            # No GPUs requested
                            requests={"nvidia.com/gpu": "0"},  
                            # No GPU limit
                            limits={"nvidia.com/gpu": "0"},   
                        ),
                        # Command to run
                        command=["python", "train.py"],  
                        # Script arguments
                        args=["--epochs", "10", "--batch-size", "32"],  
                    )
                ]
            )
        ),
    )
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")  

# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
    # Job name
    metadata = Metadata(name="demo"),  
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,     
    # Run policy
    run_policy = run_policy,           
)
# Launch the job
pytorch_job.create()

List Training Jobs

from sagemaker.hyperpod.training import HyperPodPytorchJob
import yaml

# List all PyTorch jobs
jobs = HyperPodPytorchJob.list()
print(yaml.dump(jobs))

Describe a Training Job

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")

print(job)

List Pods for a Training Job

from sagemaker.hyperpod.training import HyperPodPytorchJob

# List Pods for an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.list_pods())

Get Logs from a Pod

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get pod logs for a job
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.get_logs_from_pod("pod-name"))

Get Training Operator Logs

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get training operator logs
job = HyperPodPytorchJob.get(name="my-pytorch-job")
print(job.get_operator_logs(since_hours=0.1))

Delete a Training Job

from sagemaker.hyperpod.training import HyperPodPytorchJob

# Get an existing job
job = HyperPodPytorchJob.get(name="my-pytorch-job")

# Delete the job
job.delete()

Inference SDK

Creating a JumpstartModel Endpoint

Pre-trained Jumpstart models can be gotten from https://sagemaker.readthedocs.io/en/v2.82.0/doc_utils/jumpstart.html and fed into the call for creating the endpoint

from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b'
)
server=Server(
    instance_type='ml.g5.8xlarge',
)
endpoint_name=SageMakerEndpoint(name='<my-endpoint-name>')

js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name
)

js_endpoint.create()

Creating a Custom Inference Endpoint (with S3)

from sagemaker.hyperpod.inference.config.hp_endpoint_config import CloudWatchTrigger, Dimensions, AutoScalingSpec, Metrics, S3Storage, ModelSourceConfig, TlsConfig, EnvironmentVariables, ModelInvocationPort, ModelVolumeMount, Resources, Worker
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

model_source_config = ModelSourceConfig(
    model_source_type='s3',
    model_location="<my-model-folder-in-s3>",
    s3_storage=S3Storage(
        bucket_name='<my-model-artifacts-bucket>',
        region='us-east-2',
    ),
)

environment_variables = [
    EnvironmentVariables(name="HF_MODEL_ID", value="/opt/ml/model"),
    EnvironmentVariables(name="SAGEMAKER_PROGRAM", value="inference.py"),
    EnvironmentVariables(name="SAGEMAKER_SUBMIT_DIRECTORY", value="/opt/ml/model/code"),
    EnvironmentVariables(name="MODEL_CACHE_ROOT", value="/opt/ml/model"),
    EnvironmentVariables(name="SAGEMAKER_ENV", value="1"),
]

worker = Worker(
    image='763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0',
    model_volume_mount=ModelVolumeMount(
        name='model-weights',
    ),
    model_invocation_port=ModelInvocationPort(container_port=8080),
    resources=Resources(
            requests={"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
            limits={"nvidia.com/gpu": 1}
    ),
    environment_variables=environment_variables,
)

tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<my-tls-bucket-name>')

custom_endpoint = HPEndpoint(
    endpoint_name='<my-endpoint-name>',
    instance_type='ml.g5.8xlarge',
    model_name='deepseek15b-test-model-name',  
    tls_config=tls_config,
    model_source_config=model_source_config,
    worker=worker,
)

custom_endpoint.create()

List Endpoints

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# List JumpStart endpoints
jumpstart_endpoints = HPJumpStartEndpoint.list()
print(jumpstart_endpoints)

# List custom endpoints
custom_endpoints = HPEndpoint.list()
print(custom_endpoints)

Describe an Endpoint

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Get JumpStart endpoint details
jumpstart_endpoint = HPJumpStartEndpoint.get(name="js-endpoint-name", namespace="test")
print(jumpstart_endpoint)

# Get custom endpoint details
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
print(custom_endpoint)

Invoke an Endpoint

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

data = '{"inputs":"What is the capital of USA?"}'
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
response = jumpstart_endpoint.invoke(body=data).body.read()
print(response)

custom_endpoint = HPEndpoint.get(name="endpoint-custom")
response = custom_endpoint.invoke(body=data).body.read()
print(response)

List Pods

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# List pods 
js_pods = HPJumpStartEndpoint.list_pods()
print(js_pods)

c_pods = HPEndpoint.list_pods()
print(c_pods)

Get Logs

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Get logs from pod 
js_logs = HPJumpStartEndpoint.get_logs(pod=<pod-name>)
print(js_logs)

c_logs = HPEndpoint.get_logs(pod=<pod-name>)
print(c_logs)

Get Operator Logs

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Invoke JumpStart endpoint
print(HPJumpStartEndpoint.get_operator_logs(since_hours=0.1))

# Invoke custom endpoint
print(HPEndpoint.get_operator_logs(since_hours=0.1))

Delete an Endpoint

from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint

# Delete JumpStart endpoint
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
jumpstart_endpoint.delete()

# Delete custom endpoint
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
custom_endpoint.delete()

Observability - Getting Monitoring Information

from sagemaker.hyperpod.observability.utils import get_monitoring_config
monitor_config = get_monitoring_config()

Space SDK

Creating a Space

from sagemaker.hyperpod.space.hyperpod_space import HPSpace
from hyperpod_space_template.v1_0.model import SpaceConfig

# Create space configuration
space_config = SpaceConfig(
    name="myspace",
    namespace="default",
    display_name="My Space",
)

# Create and start the space
space = HPSpace(config=space_config)
space.create()

List Spaces

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# List all spaces in default namespace
spaces = HPSpace.list()
for space in spaces:
    print(f"Space: {space.config.name}, Status: {space.status}")

# List spaces in specific namespace
spaces = HPSpace.list(namespace="your-namespace")

Get a Space

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get specific space
space = HPSpace.get(name="myspace", namespace="default")
print(f"Space name: {space.config.name}")
print(f"Display name: {space.config.display_name}")

Update a Space

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Update space configuration
space.update(
    display_name="Updated Space Name",
)

Start/Stop a Space

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Start the space
space.start()

# Stop the space
space.stop()

Get Space Logs

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get space and retrieve logs
space = HPSpace.get(name="myspace")

# Get logs from default pod and container
logs = space.get_logs()
print(logs)

List Space Pods

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get space and list associated pods
space = HPSpace.get(name="myspace")
pods = space.list_pods()
for pod in pods:
    print(f"Pod: {pod}")

Create Space Access

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Create VS Code remote access
vscode_access = space.create_space_access(connection_type="vscode-remote")
print(f"VS Code URL: {vscode_access['SpaceConnectionUrl']}")

# Create web UI access
web_access = space.create_space_access(connection_type="web-ui")
print(f"Web UI URL: {web_access['SpaceConnectionUrl']}")

Delete a Space

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Delete the space
space.delete()

Port Forward to a Space

from sagemaker.hyperpod.space.hyperpod_space import HPSpace

# Get existing space
space = HPSpace.get(name="myspace")

# Port forward with default remote port (8888)
space.portforward_space(local_port="8080")

# Port forward with custom remote port
space.portforward_space(local_port="8080", remote_port="8888")

Access the space via http://localhost:<local-port> after port forwarding is established. Press Ctrl+C to stop port forwarding.

Space Template Management

from sagemaker.hyperpod.space.hyperpod_space_template import HPSpaceTemplate

# Create space template from YAML file
template = HPSpaceTemplate(file_path="template.yaml")
template.create()

# List all space templates
templates = HPSpaceTemplate.list()
for template in templates:
    print(f"Template: {template.name}")

# Get specific space template
template = HPSpaceTemplate.get(name="my-template")
print(template.to_yaml())

# Update space template
template.update(file_path="updated-template.yaml")

# Delete space template
template.delete()

Examples

This repository provides both a full end-to-end example walkthrough of using the CLI for real-world training and inference workloads as well as standalone example notebooks for individual features.

End-to-End Walkthrough

End-to-End Walkthrough Example

Standalone Examples

Cluster Management Example Notebooks

CLI Cluster Management Example

SDK Cluster Management Example

Training Example Notebooks

CLI Training Init Experience Example

CLI Training Example

SDK Training Example

Inference Example Notebooks

CLI

CLI Inference Jumpstart Model Init Experience Example

CLI Inference JumpStart Model Example

CLI Inference FSX Model Example

CLI Inference S3 Model Init Experience Example

CLI Inference S3 Model Example

SDK

SDK Inference JumpStart Model Example

SDK Inference FSX Model Example

SDK Inference S3 Model Example

Disclaimer

This CLI and SDK requires access to the user's file system to set and get context and function properly. It needs to read configuration files such as kubeconfig to establish the necessary environment settings.

Working behind a proxy server ?

Follow these steps from here to set up HTTP proxy connections

SageMaker HyperPod command-line interface

Table of Contents

Overview

Prerequisites

Region Configuration

Prerequisites for Training

Prerequisites for Inference

Platform Support

ML Framework Support

Installation

Usage

Getting Started

Getting Cluster information

Connecting to a Cluster

Getting Cluster Context

Cluster Management

Initialize Cluster Configuration

Configure Cluster Parameters

Validate Configuration

Create Cluster Stack

List Cluster Stacks

Describe Cluster Stack

Delete Cluster Stack

Update Existing Cluster

Reset Configuration

Training

Option 1: Create Pytorch job through init experience

Initialize Pytorch Job Configuration

Configure Pytorch Job Parameters

Validate Configuration

Create Pytorch Job

Option 2: Create Pytorch job through create command

List Available Accelerator Partition Types

List Training Jobs

Describe a Training Job

Listing Pods

Accessing Logs

Get Operator Logs

Delete a Training Job

Inference

Jumpstart Endpoint Creation

Option 1: Create jumpstart endpoint through init experience

Initialize Jumpstart Endpoint Configuration

Configure Jumpstart Endpoint Parameters

Validate Configuration

Create Jumpstart Endpoint

Option 2: Create jumpstart endpoint through create command

Invoke a JumpstartModel Endpoint

Managing an Endpoint

List Pods

Get Logs

Get Operator Logs

Deleting an Endpoint

Custom Endpoint Creation

Option 1: Create custom endpoint through init experience

Initialize Custom Endpoint Configuration

Configure Custom Endpoint Parameters

Validate Configuration

Create Custom Endpoint

Option 2: Create custom endpoint through create command

Invoke a Custom Inference Endpoint

Managing an Endpoint

List Pods

Get Logs

Get Operator Logs

Deleting an Endpoint

Space

Create a Space

List Spaces

Describe a Space

Update a Space

Start/Stop a Space

Get Logs

Delete a Space

Port Forward to a Space

Space Template Management

Space Access

SDK

Cluster Management SDK

Creating a Cluster Stack