Optimizing Airflow Task That Transfers Data From Bigquery Into Mongodb
Solution 1:
The short answer is that asynchronous operations are muddying your profiling.
The docs on bq.query
state that the resulting google.cloud.bigquery.job.QueryJob
object is an asyncronous query job. This means that, after the query is submitted, the python interpreter does not block until you try to use the results of the query with one of the syncronous QueryJob
methods, to_dataframe()
. A significant share of the 87 seconds you're seeing is likely just spent waiting for the query to return.
You could wait for the query to be complete by calling QueryJob.done
iteratively until it returns true, then call your 2nd profiling print statement.
This isn't quite an optimization of your code, but hopefully helps move in the right direction. It's possible some tuning of the pandas roundtrip could help, but I think it's likely that most of your time is being spent waiting for read/write from your databases, and that writing more efficient or a larger number of smaller queries is going to be your only option for cutting down the total time.
Post a Comment for "Optimizing Airflow Task That Transfers Data From Bigquery Into Mongodb"