Skip to content Skip to sidebar Skip to footer

Optimizing Airflow Task That Transfers Data From Bigquery Into Mongodb

I need to improve the performance of an Airflow task that transfers data from BigQuery to MongoDB. The relevant task in my DAG uses a PythonOperator, and simply calls the following

Solution 1:

The short answer is that asynchronous operations are muddying your profiling.

The docs on bq.query state that the resulting google.cloud.bigquery.job.QueryJob object is an asyncronous query job. This means that, after the query is submitted, the python interpreter does not block until you try to use the results of the query with one of the syncronous QueryJob methods, to_dataframe(). A significant share of the 87 seconds you're seeing is likely just spent waiting for the query to return.

You could wait for the query to be complete by calling QueryJob.done iteratively until it returns true, then call your 2nd profiling print statement.

This isn't quite an optimization of your code, but hopefully helps move in the right direction. It's possible some tuning of the pandas roundtrip could help, but I think it's likely that most of your time is being spent waiting for read/write from your databases, and that writing more efficient or a larger number of smaller queries is going to be your only option for cutting down the total time.

Post a Comment for "Optimizing Airflow Task That Transfers Data From Bigquery Into Mongodb"