Pyspark Crossjoin Between 2 Dataframes With Millions Of Records
I have 2 dataframes A(35 Million records) and B(30000 records) A |Text | ------- | pqr | ------- | xyz | ------- B |Title | ------- | a | ------- | b | ------- | c | -------
Solution 1:
try using broadcast
joins
from pyspark.sql.functions importbroadcastc= functions.broadcast(A).crossJoin(B)
If you don't need and extra column "Contains" column thne you can just filter it as
display(c.filter(col("text").contains(col("Title"))).distinct())
Post a Comment for "Pyspark Crossjoin Between 2 Dataframes With Millions Of Records"