PySpark: Inconsistency In Converting Timestamp To Integer In Dataframe
I have a dataframe with a rough structure like the following: +-------------------------+-------------------------+--------+ | timestamp | adj_timestamp | v
Solution 1:
For udf
, I'm not quite sure yet why it's not working. It might be float manipulation problem when converting Python function to UDF. See how using interger output works below. Alternatively, you can resolve using a Spark function called unix_timestamp
that allows you convert timestamp. I give an example below. Hope it helps a bit.
Here I create Spark dataframe from examples that you show,
import pandas as pd
df = pd.DataFrame([
['2017-05-31 15:30:48.000', '2017-05-31 11:30:00.000', 0],
['2017-05-31 15:31:45.000', '2017-05-31 11:30:00.000', 0],
['2017-05-31 15:32:49.000', '2017-05-31 11:30:00.000', 0]],
columns=['timestamp', 'adj_timestamp', 'values'])
df = spark.createDataFrame(df)
Solve by using Spark function
Apply fn.unix_timestamp
to the column timestamp
import pyspark.sql.functions as fn
from pyspark.sql.types import *
df.select(fn.unix_timestamp(fn.col('timestamp'), format='yyyy-MM-dd HH:mm:ss.000').alias('unix_timestamp')).show()
For the first column, the output looks like this
+--------------+
|unix_timestamp|
+--------------+
| 1496259048|
| 1496259105|
| 1496259169|
+--------------+
You can put this back to timestamp using datetime
library:
import datetime
datetime.datetime.fromtimestamp(1496259048) # output as datetime(2017, 5, 31, 15, 30, 48)
Solve by converting to interger instead of float
import datetime
import time
def timeConverter(timestamp):
time_tuple = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000").timetuple()
timevalue = int(time.mktime(time_tuple)) # convert to int here
return timevalue
time_udf = fn.udf(timeConverter, IntegerType()) # output interger
df.select(time_udf(fn.col('timestamp')))
Here, we will get the same timestamp [1496259048, 1496259105, 1496259169]
as using unix_timestamp
.
Post a Comment for "PySpark: Inconsistency In Converting Timestamp To Integer In Dataframe"