Skip to content Skip to sidebar Skip to footer

Converting Spark Dataframe To Pandas Dataframe - Importerror: Pandas >= 0.19.2 Must Be Installed

I am trying to convert spark DataFrame to pandas DataFrame. I am trying to in Jupyter notebook on EMR. and I am trying following error. Pandas library is installed on master node u

Solution 1:

You need pandas on the driver node as when converting to pandas df all the data is collected to the driver and then converted

Solution 2:

We also kept getting the following error when we ran the EMR 5.33.0 step to create and manipulate dataframes .

  File "/mnt/tmp/spark-49de09b2-5f77-4c46-a562-eed3742852be/test.py", line 131, in<module>
    stores = df.toPandas()['storename'].unique().tolist()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 2086, in toPandas
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 129, in require_minimum_pandas_version
ImportError: Pandas >=0.19.2 must be installed; however, it was not found.

This is a misleading error as it is caused by conflicts of version mismatch for numpy and pandas packages. Our AWS support was able to find this one.

EMR runs it's own bootstrap after custom bootstrap actions (that you specify) which installs a set of libraries. These versions of "numpy" packages getting installed leads conflicts.For example, when we install "pandas==1.3.0" using the bootstrap script, it installs "numpy=1.21.2". But then, as part of EMR bootstrap (also called application provisioning), it's installing "numpy=1.16.5". Because of this, there is a mismatch in numpy version between what pip3 interprets and what python/pyspark interprets.

To fix it,

Step 1: Create a secondary bootstrap action script and upload it to S3

#!/bin/bash# keep checking the status of node provisioning, once it's SUCCESSFUL, run your codewhiletrue; do
    NODEPROVISIONSTATE=` sed -n '/localInstance [{]/,/[}]/{
                /nodeProvisionCheckinRecord [{]/,/[}]/ {
                /status: / { p }
                /[}]/a
                }
                /[}]/a
                }' /emr/instance-controller/lib/info/job-flow-state.txt | awk ' { print $2 }'`

if [ "$NODEPROVISIONSTATE" == "SUCCESSFUL" ]; thensleep 10;
    echo"Running my post provision bootstrap"

    sudo pip3 install pandas==1.3.0

fisleep 10;
done

Step 2: Modify your existing bootstrap script

#!/bin/bash -x
aws s3 cp s3://<BUCKET>/secondary-bootstrap.sh /home/hadoop/secondary-bootstrap.sh && sudo bash /home/hadoop/secondary-bootstrap.sh &
exit 0

Step 3: Relaunch your EMR cluster

Post a Comment for "Converting Spark Dataframe To Pandas Dataframe - Importerror: Pandas >= 0.19.2 Must Be Installed"