Airflow tutorial 6: Build a data pipeline using Google Bigquery

  Рет қаралды 108,397

Tuan Vu

Tuan Vu

Күн бұрын

In this tutorial, we will build a data pipeline by integrating Airflow with another cloud service: Google Cloud Bigquery.
🔥 Want to master SQL? Get the full SQL course: bit.ly/3DAlxZc
👍 Subscribe for more tutorials like this: bit.ly/2MEfT1H
- Tutorial post: www.applydatas...
- Code: github.com/tua...
- Instruction to set up and run this tutorial: github.com/tua...
- github-trend-analysis - jupyter notebook: github.com/tua...
⭐️Want to learn more from me? Connect with me here:
My website - www.applydatas...
Instagram: / tuan.a.vu
Facebook - / applydatascience
Twitter - / tuanvu8​
More resources:
- How to aggregate data for BigQuery using Apache Airflow: cloud.google.c...
- GitHub data, ready for you to explore with BigQuery: blog.github.co...
- Hacker News on BigQuery: / hacker-news-on-bigquer...
- BigQuery pricing: cloud.google.c...

Пікірлер: 64
@tuan-vu
@tuan-vu 5 жыл бұрын
Hi guys, I found another useful video talking about BigQuery, so you can guys can explore running analysis on terabytes of data in the cloud free of charge. - "How to run a terabyte of queries each month without a credit card": kzbin.info/www/bejne/rWXQq3hjYtijqLs I hope you guys having fun with this tutorial. If you have any question, feel free to ask me on the comment or on the Facebook group: facebook.com/groups/542662596159086/ Also, sorry for the long wait, I am currently having a cold so I cannot work on any new tutorials. However, I am recovering so I will try my best to upload new videos for you guys.
@libbyy5608
@libbyy5608 5 жыл бұрын
Feel better!
@tuan-vu
@tuan-vu 5 жыл бұрын
@Lan YAO Thank you. I am feeling a lot better now. I am taking some time off to celebrate the Lunar New Year with my family, but I will try to record and upload the new videos very soon. Finally, Happy New Year, I wish you great well being and enduring flourishing.
@saeedrahman8362
@saeedrahman8362 5 жыл бұрын
The best airflow tutorial out there, and combined with google cloud integration, it can't get any better. Looking forward to Airflow Tutorial 7
@vbncml
@vbncml 3 жыл бұрын
This is so helpful, I am actually watching every ad in this video to somehow support you! Thank you so much!
@oldguywholifts
@oldguywholifts 3 жыл бұрын
its really appreciated you take time to explain these concepts and practically!
@lukosiv6713
@lukosiv6713 5 жыл бұрын
I like the way you describe sql queries!
@afterworkchillin521
@afterworkchillin521 3 жыл бұрын
This saved my life today :)
@renubhupati4137
@renubhupati4137 4 жыл бұрын
Amazing tutorial, very well explained. Thanks and all the best!
@harrykekgmail
@harrykekgmail 5 жыл бұрын
Once again! thanks very much.
@vibhatha
@vibhatha 4 жыл бұрын
Very nice work on Airflow. Keep up the good work. 👍
@datoalavista581
@datoalavista581 3 жыл бұрын
Fantastic ! . Huge commitment and passion . Thank you !
@nguyendao7649
@nguyendao7649 3 жыл бұрын
Idol ... Mọi thứ em đang gặp phải, video của anh giải quyết tất cả :))
@AsavPatel
@AsavPatel 5 жыл бұрын
great video. thanks for taking the time to create quality content. can you please add a tutorial on using Airflow with KubernetesExecutorOperator and also how can we pass dynamic variable parameters to existing workflows? or create Dynamic DAGs?
@dylanjaramillo3168
@dylanjaramillo3168 3 жыл бұрын
Exactly what I needed, Thanks a ton!
@konutek7716
@konutek7716 3 жыл бұрын
Good stuff, thanks
@aderintosadiq6059
@aderintosadiq6059 3 жыл бұрын
I really like how he assumed I've got a credit card
@matheuscortezi680
@matheuscortezi680 3 жыл бұрын
The code in the github repository is different from the one in the video tutorial, and this is causing some errors: 1- the path to the variable file (.json) is not correct, so you have to pass it manually every time you run the shell command for the docker. It's not a big problem, but it makes learning more difficult 2- the variable file has the keys and values for the BQ_CONN_ID, BQ_PROJECT and BQ_DATASET. so because the code on the screen now in the minute 39:31 is not the same on the repo, we have to set the values on the JSON file to what is on the screen I had no prior experience with airflow and this set me back a whole day, sadly. I had to ask a friend for help. The code on the repo being different from the code on the screen makes it very confusing for beginners
@fei7948
@fei7948 3 жыл бұрын
Hi Matheus, Thanks for your updates. It's really helpful. However, for error 1, can you please provide the correct path? thanks a lot
@matheuscortezi680
@matheuscortezi680 3 жыл бұрын
@@fei7948 Hello, sorry for the delay. I'll get them ASAP and post here.
@ngoctunguyen222
@ngoctunguyen222 2 жыл бұрын
@@matheuscortezi680 Hi, can you pls, give the correct path? :D thanks in advance
@abudan799
@abudan799 Жыл бұрын
Please can you post how you solve the errors. Thanks in advance
@occo5877
@occo5877 4 жыл бұрын
Thanks for this! Great explanation
@rembautimes8808
@rembautimes8808 5 жыл бұрын
Great tutorial. Thanks for taking the time to make it comprehensive.
@ankitseth5676
@ankitseth5676 Жыл бұрын
How do I get the data from one task (select query) and do some processing and then use that processed data in the next task?
@miliar1732
@miliar1732 5 жыл бұрын
Great content! Please keep going, you're awesome!
@abhijithsenthilraj6362
@abhijithsenthilraj6362 3 жыл бұрын
Hi, Fantastic video. I have question about. 1)Can we store the output of first task(single value) into local variable and pass into for 2nd or downstream task. 2)For example if we are doing incremental load on table,task 1 gets the max(id) from query 1, pass that id to 2nd queries where clause
@ngoctunguyen222
@ngoctunguyen222 2 жыл бұрын
Dear anh Tuan, Appreciate on your instruction, but Im facing with issue thath when docker up, it come out with error that cannot import httplib2, so can you help me on this issue? Thanks in advance!
@ngoctunguyen222
@ngoctunguyen222 2 жыл бұрын
I've solved this issue, thanks for nice tutorial :)
@nicandrieux10
@nicandrieux10 Жыл бұрын
Hi, I have the same error. How did you solve it ? 😅 thanks in advance !
@beyondtheclouds95
@beyondtheclouds95 Жыл бұрын
can you please share your solution
@othithutrang1859
@othithutrang1859 5 жыл бұрын
thanks for such a great work!
@rahulpandey5872
@rahulpandey5872 4 жыл бұрын
I am getting the following error-"airflow.exceptions.AirflowException: dag_id could not be found: bigquery_github_trends. Either the dag did not exist or it failed to parse." Is there some DB entry i need to make or a config setting
@rahulpandey5872
@rahulpandey5872 4 жыл бұрын
Also I cannot see this folder structure. /usr/local/airflow. Am i missing anything
@MrJasonmlee94
@MrJasonmlee94 3 жыл бұрын
@@rahulpandey5872 getting the same error. How do we resolve this?
@fei7948
@fei7948 3 жыл бұрын
@@MrJasonmlee94 Any updates? I had the same issue. Thanks
@estefannyrodriguez3828
@estefannyrodriguez3828 2 жыл бұрын
Hi where is the repository apache / incubator airflow?
@martinpishpecki3333
@martinpishpecki3333 4 жыл бұрын
Nice tutorial. There is a problem though. When I try to create my BigQueryOperator task and I run the test, I can see the task creates rows in BQ table but the test output is a link to a json. When I open it, I see this error: "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See developers.google.com/identity/sign-in/web/devconsole-project." If I try to run the dag in Airflow, the job never stops. Just keeps retrying without any errors. I am guessing because of the error. Any help is appreciated.
@arunrimal3266
@arunrimal3266 2 жыл бұрын
@ Tuan Vu where this githubarchive data comes from ? Is this public data set ?? if it is there is no such named public data set in google query. Could you please expalin if this public data set or not . if it is , could you please tell under which name it is available in public data set. thank you
@alinahohryakova1177
@alinahohryakova1177 Жыл бұрын
Idk if it is still matter,but the answer is: GitHub Activity Data
@bahaaelsayed3938
@bahaaelsayed3938 Жыл бұрын
Are the data stilll available now
@sivashankaramaravelpandian6298
@sivashankaramaravelpandian6298 3 жыл бұрын
In t3, I am getting ERROR - 400 Partitioning specification must be provided in order to create partitioned table. I tried to run the script in Airflow by Google Composer
@juancarlosobando3045
@juancarlosobando3045 2 жыл бұрын
Hi Sivashankar! I'm getting the same error. Did you solve the issue? Thanks!
@sophiayang3476
@sophiayang3476 5 жыл бұрын
When using the BigQueryOperator, would it be possible to write and save the output to a local csv file? And how would you do that? Thanks!
@tuan-vu
@tuan-vu 5 жыл бұрын
Hi Sophia, BigQueryOperator does not support writing its query results directly to a local csv file. The easiest way is to use BigQueryOperator to write the results to another table, and then export the table to GCS using BigQueryToCloudStorageOperator, finally use BashOperator to get the file from GCS to local csv file. Another way is writing your own plugins or use PythonOperator with BigQuery API to handle this task,
@shardulkatare9395
@shardulkatare9395 4 жыл бұрын
Great video. Although I have a error when I try to run the gcloud_example.sh. I get this error: File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 4551, in get raise KeyError('Variable {} does not exist'.format(key)) KeyError: 'Variable bigquery_github_trends_variables does not exist' What should I do? please Help
@prabhukiran5764
@prabhukiran5764 5 жыл бұрын
Is there an operator to read from sales force and save the data on Google storage and process it for big query operations?
@tuan-vu
@tuan-vu 5 жыл бұрын
Hi Prabhu, I don't see any community operator that currently support getting data from Salesforce directly to GCS. Only - salesforce_to_s3_operator: github.com/airflow-plugins/salesforce_plugin/blob/master/operators/salesforce_to_s3_operator.py - SalesforceToFileOperator: github.com/TheF1rstPancake/airflow-salesforce Maybe you can use the salesforce_to_s3_operator as a template to write your own plugin to get data from Salesforce directly to GCS. Or if you just want to play around, you can do SalesforceToFileOperator -> FileToGoogleCloudStorageOperator -> GoogleCloudStorageToBigQueryOperator.
@prabhukiran5764
@prabhukiran5764 5 жыл бұрын
Can you make a video on hooks and plugins
@tuan-vu
@tuan-vu 5 жыл бұрын
Sure, it is on my list now. Will record and publish the tutorial in the near future.
@kienbach3520
@kienbach3520 4 жыл бұрын
biết sql là đủ làm cái này chưa ạ ad
@turtlegangbigshells
@turtlegangbigshells 4 жыл бұрын
Hey really great video! I have so much more understand now but when I try to start the jupyter kernel I run into ' Invalid credentials : Token authentication is enabled' problem where the juypter wants a password before I can use the jupyter notebooks. I looked it up online but there is not a good solution since the jupyter notebook is in a docker container. Any ideas? Thanks so much!
@arunkumarmuthu4549
@arunkumarmuthu4549 5 жыл бұрын
Hi, i wants to load the Oracle data into big query.Is there any workarounds?
@tuan-vu
@tuan-vu 5 жыл бұрын
You can either: - Write your own plugins similar to this OracleToAzure Operator: github.com/apache/airflow/blob/master/airflow/contrib/operators/oracle_to_azure_data_lake_transfer.py - Or just write your own Python function to do the transfer and use the PythonOperator to call. Or you don't have to move the data at all. If your data is already in Oracle, just wrap your SQL to do ETL and use Airflow to connect all the tasks as a complete data pipeline.
@narennanchari
@narennanchari 5 жыл бұрын
@@tuan-vu Can you please do a tutorial of this? Transfer data from oracle to Big Query using Air Flow Much appreciate your work.
@ramielkady938
@ramielkady938 3 жыл бұрын
90 % pluming 10% actually doing something. Problem is companies pay us to actually do stuff, not to spend 6 months setting things up ....
@sophiayue2484
@sophiayue2484 2 жыл бұрын
Hi Tuan, Your tutorials help to get started of Airflow. Than you for the great KZbins. I have following questions/issues, if you might help, it'll be great. 1). OpenTelemetry I ran 'bash run_gcloud_example.sh' form Git Bash and I got error message regarding ‘OpenTelemetry ‘ below. I googled ‘OpenTelemetry’ and I didn’t get clue to fix the issue. Do you happen to know how to fix the issue? “ webserver_1 | [2022-04-27 16:00:18,511] {{opentelemetry_tracing.py:29}} INFO - This service is instrumented using OpenTelemetry. OpenTelemetry could not be imported; please add opentelemetry-api and opentelemetry-instrumentation packages in order to get BigQuery Tracing data. “ 2). Issues to run the test I ran “docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test bigquery_github_trends bq_check_githubarchive_day 2018-12-02”. I followed your instruction to create a project ‘apply-ds-test’, create a dataset, and create a service account under Google Cloud. I also created a key in Json file which I saved it in “ c:/Users/Asus/Airflow/airflow-tutorial/examples/gcloud-example/dags/support/keys/platinum-banner-348308-aa2fa94dbbfe.json” a). Try1: defined keyfile path I defined keyfile path: “/usr/local/Airflow/airflow-tutorial/examples/gcloud-example/dags/support/keys/platinum-banner-348308-aa2fa94dbbfe.json” and when I ran the test I got “FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/Airflow/airflow-tutorial/examples/gcloud-example/dags/support/keys/platinum-banner-348308-aa2fa94dbbfe.json'” b). Try2: Copy the content with ’{}’ and without ‘{}’ from Json file to Keyfile JSON I copied the content with ’{}’ and without ‘{}’ from Json file to Keyfile JSON and let keyfile path as blank and I got ‘” File "/usr/local/lib/python3.6/site-packages/airflow/contrib/hooks/gcp_api_base_hook.py", line 131, in _authorize credentials = self._get_credentials() File "/usr/local/lib/python3.6/site-packages/airflow/contrib/hooks/gcp_api_base_hook.py", line 115, in _get_credentials raise AirflowException('Invalid key JSON.') airflow.exceptions.AirflowException: Invalid key JSON. ERROR: 1” Could you please advise how to resolve the issues? Thanks. Sophia
@enigma_mysterium
@enigma_mysterium 3 жыл бұрын
Hi Tuan, Tank you a lot for putting out this tutorial for the comunity! I was installing and configuring Docker and Airflow and my Windows 10 I faced an issue that my default cmd path happen to be C:\Windows\System32. So I ran a command below inside that path. curl -LfO "airflow.apache.org/docs/apache-airflow/2.1.0/docker-compose.yaml" So, the next step was to create 3 directories and creante an environment in the same path by running mkdir .\dags .\logs .\plugins echo -e “AIRFLOW_UID=$(id -u) AIRFLOW_GID=0” > .env But that path already had a folder called Logs. So I cannot create another one for my Airflow. Do you know if I could re-run the same commad but a differet directory which does not have "Logs" folder? curl -LfO "airflow.apache.org/docs/apache-airflow/2.1.0/docker-compose.yaml" Should I perform any actions before re-running that command to udon something? I was referencing this airflow.apache.org/docs/apache-airflow/stable/start/docker.html To me this is not an issue as it just dowloads the file in mu crrent directory and sets up a virtual environment for AirFlow.
Airflow tutorial 7: Airflow variables
16:42
Tuan Vu
Рет қаралды 53 М.
Airflow for Beginners - Run Spotify ETL Job in 15 minutes!
16:38
Karolina Sowinska
Рет қаралды 142 М.
Cute
00:16
Oyuncak Avı
Рет қаралды 5 МЛН
Шок. Никокадо Авокадо похудел на 110 кг
00:44
Big Query Live Training - A Deep Dive into Data Pipelining
1:02:05
MeasureSchool
Рет қаралды 61 М.
Airflow tutorial 1: Introduction to Apache Airflow
16:24
Tuan Vu
Рет қаралды 582 М.
Data Warehousing With BigQuery: Best Practices (Cloud Next '19)
43:08
Google Cloud Tech
Рет қаралды 146 М.
Learning Apache Airflow with Python in easy way in 40 Minutes
41:56