Home / Apache Spark / Apache Spark Sql / Pyspark / Pyspark Dataframes / Python

Pyspark Crossjoin Between 2 Dataframes With Millions Of Records

October 26, 2023 Post a Comment

I have 2 dataframes A(35 Million records) and B(30000 records) A |Text | ------- | pqr | ------- | xyz | ------- B |Title | ------- | a | ------- | b | ------- | c | -------

Solution 1:

try using broadcast joins

from pyspark.sql.functions importbroadcastc= functions.broadcast(A).crossJoin(B)

If you don't need and extra column "Contains" column thne you can just filter it as

display(c.filter(col("text").contains(col("Title"))).distinct())

Python Freelancers

Pyspark Crossjoin Between 2 Dataframes With Millions Of Records

Solution 1:

Post a Comment for "Pyspark Crossjoin Between 2 Dataframes With Millions Of Records"