Custom Function Over Pyspark Dataframe
I'm trying to apply a custom function over rows in a pyspark dataframe. This function takes the row and 2 other vectors of the same dimension. It outputs the sum of the values of t
Solution 1:
If you've got non-column data that you want to use inside a function along with column data to compute a new column, a UDF + closure + withColumn as described here is a good place to start.
B = [2,0,1,0]
V = [5,1,2,4]
v_sum_udf = F.udf(lambda row: V_sum(row, B, V), FloatType())
spk_df.withColumn("results", v_sum_udf(F.array(*(F.col(x) for x in spk_df.columns))))
Post a Comment for "Custom Function Over Pyspark Dataframe"