Skip to content Skip to sidebar Skip to footer

Cosine Similarity For Two Pyspark Dataframes

I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID Custo

Solution 1:

You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the columns called CustomerValue are the different components of a vector that represents the feature you want to get the similarities for between two customers, you can do it by transposing the data frame and then do a join on the CuatomerValues.

The transposition can be done with an explode (more details about transposing a data frame here):

from pyspark.sql import functions as F

kvs = F.explode(F.array([
        F.struct(F.lit(c).alias('key'), F.columm(c).alias('value')) for c in ['CustomerValue1', 'CustomerValue2']
      ])).alias('kvs')

dft1 = (df1.select(['CustomerID', kvs])
        .select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
        )
dft2 = (df2.select(['CustomerID', kvs])
        .select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
        )

where dft1 and dft2 denote the transposed data frames. Once you transposed them, you can join them on the column names:

dft2 = (dft2.withColumnRenamed('CustomerID', 'CustomerID2')
        .withColumnRenamed('column_value', 'column_value2')
       )
cosine = (dft1.join(dft2, dft1.column_name = dft2.column_name)
          .groupBy('CustomerID' , 'CustomerID2')
          .agg(F.sum(F.column('column_value')*F.column('column_value2')).alias('cosine_similarity'))
         )

Now in cosine you have three columns: the CustomerID from the first and second data frames and the cosine similarity (provided that the values were normalized first). This has the oadvantage that you only have rows for CustomerID pairs that have a nonzero similarity (in case of zero values for some CustomerIDs). For your example:

df1:

CustomerID CustomerValue CustomerValue2
12         .17           .08

df2:

CustomerID CustomerValue CustomerValue
15         .17           .14
16         .40           .43
18         .86           .09

cosine:

CustomID CustomID2 cosine_similarity
12       15        .0401
12       16        .1024
12       18        .1534

Of course these are not the real cosine similarities yet, you need to normalize the values first. You can do that with a group by:

(df.groupBy('CustomerID')
 .agg(F.sqrt(F.sum(F.column('column_value')*F.column('column_value'))).alias('norm'))
 .select('CustomerID', F.column('column_name'), (F.column('column_value')/F.column('norm')).alias('column_value_norm'))
)

After normalizing the columns your cosine similarities become the following:

CustomID CustomID2 cosine_similarity
12       15        .970
12       16        .928
12       18        .945

The large similarity values are due to the low dimensionality (two components only).


Post a Comment for "Cosine Similarity For Two Pyspark Dataframes"