Skip to content Skip to sidebar Skip to footer

Pyspark Crossjoin Between 2 Dataframes With Millions Of Records

I have 2 dataframes A(35 Million records) and B(30000 records) A |Text | ------- | pqr | ------- | xyz | ------- B |Title | ------- | a | ------- | b | ------- | c | -------

Solution 1:

try using broadcast joins

from pyspark.sql.functions importbroadcastc= functions.broadcast(A).crossJoin(B)

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())

Post a Comment for "Pyspark Crossjoin Between 2 Dataframes With Millions Of Records"