Skip to content Skip to sidebar Skip to footer

Gcp Dataproc Custom Image Python Environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some p

Solution 1:

Updated answer (Q2 2021)

The customize_conda.sh script is the recommended way of customizing Conda env for custom images.

If you need more than the script does, you can read the code and create your own script, but usually you want to use the absolute path e.g., /opt/conda/anaconda/bin/conda, /opt/conda/anaconda/bin/pip, /opt/conda/miniconda3/bin/conda, /opt/conda/miniconda3/bin/pip to install/uninstall packages for the Anaconda/Miniconda env.

Original answer

I'd recommend you first read Configure the cluster's Python environment which gives an overview of Dataproc's Python environment on different image versions, as well as instructions on how to install packages and select Python for PySpark jobs.

In your case, 1.4 already comes with miniconda3. Init actions and jobs are executed as root. /etc/profile.d/effective-python.sh is executed to initialize the Python environment when creating the cluster. But due to the order of custom image script (first) and (then) optional component activation order, miniconda3 was not yet initialized at custom image build time, so your script actually customizes the OS's system Python, then during cluster creation time, miniconda3 initializes Python which overrides the OS's system Python.

I found a solution that, in your custom image script, add this code at the beginning, it will put you in the same Python environment as that of your jobs:

# This is /usr/bin/pythonwhich python 

# Activate miniconda3 optional component.cat >>/etc/google-dataproc/dataproc.properties <<EOF
dataproc.components.activate=miniconda3
EOF
bash /usr/local/share/google/dataproc/bdutil/components/activate/miniconda3.sh
source /etc/profile.d/effective-python.sh

# Now this is /opt/conda/default/bin/pythonwhich python 

then you could install packages, e.g.:

conda install <package> -y

Post a Comment for "Gcp Dataproc Custom Image Python Environment"