SFT example notebook references inaccessible S3 dataset URI

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
The SFT finetuning example notebook hardcodes an S3 URI (s3://mc-flows-sdk-testing/...) that external users do not have access to. Any user following the notebook will hit a 403 Forbidden error immediately when registering the dataset.

The notebook should either use a publicly accessible dataset or clearly instruct users to substitute their own, with a link to the required dataset format.

To reproduce
Run the following cell from sft_finetuning_example_notebook_pysdk_prod_v3.ipynb as-is:

from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="demo-1",
    source="s3://mc-flows-sdk-testing/input_data/sft/sample_data_256_final.jsonl"
)

Expected behavior
The example notebook should work out of the box, or clearly guide users to supply their own dataset with instructions on the required format.

Screenshots or logs

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:9                                                                                    │
│                                                                                                  │
│    6 # Register dataset in SageMaker AI Registry                                                 │
│    7 # This creates a versioned dataset that can be referenced by ARN                            │
│    8 # Provide a source (it can be local file path or S3 URL)                                    │
│ ❱  9 dataset = DataSet.create(                                                                   │
│   10 │   name="demo-1",                                                                          │
│   11 │   source="s3://mc-flows-sdk-testing/input_data/sft/sample_data_256_final.jsonl"           │
│   12 )                                                                                           │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/core/telemetry/telemetry_logging.py:172 in wrapper │
│ ❱ 172 │   │   │   │   │   │   raise caught_ex                                                    │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/core/telemetry/telemetry_logging.py:143 in wrapper │
│ ❱ 143 │   │   │   │   │   response = func(*args, **kwargs)                                       │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/ai_registry/dataset.py:283 in create               │
│   280 │   │   │   │   local_path = tmp_file.name                                                 │
│   281 │   │                                                                                      │
│   282 │   │   │   try:                                                                           │
│ ❱ 283 │   │   │   │   AIRHub.download_from_s3(source, local_path)                                │
│   284 │   │   │   │   cls._validate_dataset_format(local_path)                                   │
│   285 │   │   │   finally:                                                                       │
│   286 │   │   │   │   if os.path.exists(local_path):                                             │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/core/telemetry/telemetry_logging.py:180 in wrapper │
│ ❱ 180 │   │   │   │   return func(*args, **kwargs)                                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/sagemaker/ai_registry/air_hub.py:290 in download_from_s3     │
│   287 │   │   parsed = urlparse(s3_uri)                                                          │
│   288 │   │   bucket = parsed.netloc                                                             │
│   289 │   │   key = parsed.path.lstrip("/")                                                      │
│ ❱ 290 │   │   AIRHub._s3_client.download_file(bucket, key, local_path)                           │
│   291                                                                                            │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/context.py:123 in wrapper                           │
│ ❱ 123 │   │   │   │   return func(*args, **kwargs)                                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/boto3/s3/inject.py:223 in download_file                      │
│   222 │   with S3Transfer(self, Config) as transfer:                                             │
│ ❱ 223 │   │   return transfer.download_file(                                                     │
│   224 │   │   │   bucket=Bucket,                                                                 │
│   225 │   │   │   key=Key,                                                                       │
│   226 │   │   │   filename=Filename,                                                             │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/boto3/s3/transfer.py:484 in download_file                    │
│ ❱ 484 │   │   │   future.result()                                                                │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/futures.py:111 in result                          │
│ ❱ 111 │   │   │   return self._coordinator.result()                                              │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/futures.py:287 in result                          │
│   286 │   │   if self._exception:                                                                │
│ ❱ 287 │   │   │   raise self._exception                                                          │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/tasks.py:272 in _main                             │
│ ❱ 272 │   │   │   self._submit(transfer_future=transfer_future, **kwargs)                        │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/s3transfer/download.py:359 in _submit                        │
│   356 │   │   │   transfer_future.meta.size is None                                              │
│   357 │   │   │   or transfer_future.meta.etag is None                                           │
│   358 │   │   ):                                                                                 │
│ ❱ 359 │   │   │   response = client.head_object(                                                 │
│   360 │   │   │   │   Bucket=transfer_future.meta.call_args.bucket,                              │
│   361 │   │   │   │   Key=transfer_future.meta.call_args.key,                                    │
│   362 │   │   │   │   **transfer_future.meta.call_args.extra_args,                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/client.py:602 in _api_call                          │
│ ❱ 602 │   │   │   return self._make_api_call(operation_name, kwargs)                             │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/context.py:123 in wrapper                           │
│ ❱ 123 │   │   │   │   return func(*args, **kwargs)                                               │
│                                                                                                  │
│ .venv/lib/python3.11/site-packages/botocore/client.py:1078 in _make_api_call                    │
│   1075 │   │   │   │   'error_code_override'                                                     │
│   1076 │   │   │   ) or error_info.get("Code")                                                   │
│   1077 │   │   │   error_class = self.exceptions.from_code(error_code)                           │
│ ❱ 1078 │   │   │   raise error_class(parsed_response, operation_name)                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

System information

  • SageMaker Python SDK version: SageMaker 3.5.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): SFTTrainer
  • Framework version: N/A
  • Python version: 3.11
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Additional context
Affected file: v3-examples/model-customization-examples/sft_finetuning_example_notebook_pysdk_prod_v3.ipynb