Skip to content Skip to sidebar Skip to footer

Pyspark Numeric Window Group By

I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window function for numeric (non-date)

Solution 1:

You can reuse timestamp one and express parameters in seconds. Tumbling:

from pyspark.sql.functions import col, window

df.withColumn(
    "window",
    window(
         col("foo").cast("timestamp"), 
         windowDuration="2 seconds"
    ).cast("struct<start:bigint,end:bigint>")
).show()

# +---+-------+              # |foo| window|# +---+-------+# | 10|[10,12]|# | 11|[10,12]|# | 12|[12,14]|# | 13|[12,14]|# +---+-------+

Rolling one:

df.withColumn(
    "window", 
    window(
        col("foo").cast("timestamp"),
        windowDuration="2 seconds", slideDuration="1 seconds"
     ).cast("struct<start:bigint,end:bigint>")
).show()

# +---+-------+# |foo| window|# +---+-------+# | 10| [9,11]|# | 10|[10,12]|# | 11|[10,12]|# | 11|[11,13]|# | 12|[11,13]|# | 12|[12,14]|# | 13|[12,14]|# | 13|[13,15]|# +---+-------+

Using groupBy and start:

w =window(col("foo").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
start= w.start.alias("start")
df.groupBy(start).count().show()

+-----+-----+|start|count|+-----+-----+|10|2||12|2|+-----+-----+

Post a Comment for "Pyspark Numeric Window Group By"