HyperparameterTuner drops content_type when converting InputData to Channel

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When HyperparameterTuner.tune() receives InputData objects as inputs, it converts them to Channel objects internally but drops the content_type field during conversion. This causes built-in algorithms (e.g., XGBoost) to fail with validate_data_file_path errors because the container doesn't know the data format.

To reproduce

from sagemaker.train.configs import InputData
from sagemaker.train.tuner import HyperparameterTuner

train_input = InputData(
    channel_name="train",
    data_source="s3://my-bucket/train/train.csv",
    content_type="csv",  # <-- this gets dropped
)

tuner = HyperparameterTuner(
    model_trainer=model_trainer,
    objective_metric_name="validation:auc",
    hyperparameter_ranges=hyperparameter_ranges,
    objective_type="Maximize",
    max_jobs=12,
    max_parallel_jobs=3,
    strategy="Bayesian",
)

tuner.tune(inputs=[train_input])
# All training jobs fail with:
# AlgorithmError: validate_data_file_path(train_path, content_type)

Root Cause
In sagemaker/train/tuner.py, the _create_hyperparameter_tuning_job method converts InputDataChannel without passing content_type:


# tuner.py lines 1362-1373
```python
if isinstance(inp, InputData):
    input_data_config.append(Channel(
        channel_name=inp.channel_name,
        data_source=DataSource(
            s3_data_source=S3DataSource(
                s3_data_type="S3Prefix",
                s3_uri=inp.data_source,
                s3_data_distribution_type="FullyReplicated"
            )
        )
        # content_type is missing here!
    ))

Suggested Fix

if isinstance(inp, InputData):
    input_data_config.append(Channel(
        channel_name=inp.channel_name,
        content_type=inp.content_type,  # <-- add this
        data_source=DataSource(
            s3_data_source=S3DataSource(
                s3_data_type="S3Prefix",
                s3_uri=inp.data_source,
                s3_data_distribution_type="FullyReplicated"
            )
        )
    ))

Workaround
Pass Channel objects directly instead of InputData:

from sagemaker.core.shapes import Channel, DataSource, S3DataSource

train_input = Channel(
    channel_name="train",
    content_type="csv",
    data_source=DataSource(
        s3_data_source=S3DataSource(
            s3_data_type="S3Prefix",
            s3_uri="s3://my-bucket/train/train.csv",
            s3_data_distribution_type="FullyReplicated",
        )
    ),
)

tuner.tune(inputs=[train_input])  # works correctly

Environment
SageMaker Python SDK version: 3.0.1
Python version: 3.12
Built-in algorithm: XGBoost 1.7-1