Skip to content Skip to sidebar Skip to footer

Pyspark Replace Nan With Null

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL. I tried somethin

Solution 1:

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+        |   a|  b|+----+---+|1|NaN||null|1.0|+----+---+

df = df.replace(float('nan'), None)
df.show()

+----+----+|   a|   b|+----+----+|1|null||null|1.0|+----+----+

You can use the .replace function to change to null values in one line of code.

Solution 2:

I finally found the answer after Googling around a bit.

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+|   a|  b|+----+---+|1|NaN||null|1.0|+----+---+

import pyspark.sql.functions as F
columns = df.columns
forcolumnin columns:
    df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))

sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()

+----+----+|   a|   b|+----+----+|1|null||null|1.0|+----+----+

It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

Post a Comment for "Pyspark Replace Nan With Null"