initialization-actions/python at main · GoogleCloudDataproc/initialization-actions

Python setup and configuration tools

Using this initialization action

⚠️ NOTICE: See best practices of using initialization actions in production.

Install PyPI packages

Install PyPI packages into user Python environment using pip command.

Note: when using this initialization action with automation, pinning package versions (see examples) is strongly encouraged to make cluster environment hermetic.

Options

  • PIP_PACKAGES - a space separated list of packages to install. Packages can contain version selectors.

Examples

Installing one package at head

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --metadata 'PIP_PACKAGES=pandas' \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh

Installing several packages with version selectors

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh

Install Conda packages

Install Conda packages into user Python environment using conda command.

Note: when using this initialization action with automation, pinning package versions (see examples) is strongly encouraged to make cluster environment hermetic.

Options

  • CONDA_CHANNELS - a space separated list of new channels to configure in addition to default channels.
  • CONDA_PACKAGES - a space separated list of packages to install. Packages can contain version selectors.

Examples

Installing one package at head

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --metadata 'CONDA_PACKAGES=scipy' \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh

Installing several packages with version selectors

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --metadata 'CONDA_PACKAGES=scipy=0.15.0 curl=7.26.0' \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh

Installing packages from a new channel

REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --metadata 'CONDA_CHANNELS=bioconda' \
    --metadata 'CONDA_PACKAGES=recentrifuge=1.0.0' \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh