Skip to content Skip to sidebar Skip to footer

PySpark: Inconsistency In Converting Timestamp To Integer In Dataframe

I have a dataframe with a rough structure like the following: +-------------------------+-------------------------+--------+ | timestamp | adj_timestamp | v

Solution 1:

For udf, I'm not quite sure yet why it's not working. It might be float manipulation problem when converting Python function to UDF. See how using interger output works below. Alternatively, you can resolve using a Spark function called unix_timestamp that allows you convert timestamp. I give an example below. Hope it helps a bit.

Here I create Spark dataframe from examples that you show,

import pandas as pd

df = pd.DataFrame([
    ['2017-05-31 15:30:48.000', '2017-05-31 11:30:00.000', 0], 
    ['2017-05-31 15:31:45.000', '2017-05-31 11:30:00.000', 0],
    ['2017-05-31 15:32:49.000', '2017-05-31 11:30:00.000', 0]], 
    columns=['timestamp', 'adj_timestamp', 'values'])
df = spark.createDataFrame(df)

Solve by using Spark function

Apply fn.unix_timestamp to the column timestamp

import pyspark.sql.functions as fn
from pyspark.sql.types import *
df.select(fn.unix_timestamp(fn.col('timestamp'), format='yyyy-MM-dd HH:mm:ss.000').alias('unix_timestamp')).show()

For the first column, the output looks like this

+--------------+
|unix_timestamp|
+--------------+
|    1496259048|
|    1496259105|
|    1496259169|
+--------------+

You can put this back to timestamp using datetime library:

import datetime
datetime.datetime.fromtimestamp(1496259048) # output as datetime(2017, 5, 31, 15, 30, 48)

Solve by converting to interger instead of float

import datetime
import time

def timeConverter(timestamp):
    time_tuple = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000").timetuple()
    timevalue = int(time.mktime(time_tuple)) # convert to int here
    return timevalue

time_udf = fn.udf(timeConverter, IntegerType()) # output interger

df.select(time_udf(fn.col('timestamp'))) 

Here, we will get the same timestamp [1496259048, 1496259105, 1496259169] as using unix_timestamp.


Post a Comment for "PySpark: Inconsistency In Converting Timestamp To Integer In Dataframe"