How to build and automate your Python ETL pipeline with Airflow | Data pipeline | Python

  Рет қаралды 102,086

BI Insights Inc

BI Insights Inc

Күн бұрын

Пікірлер: 101
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Videos in this series: Build ETL pipeline: kzbin.info/www/bejne/mpfSpqKebraXi9k&t ETL Load Reference Data: kzbin.info/www/bejne/jV6bpXh8drx3a6M ETL Incremental Data Load (Source Change Detection): kzbin.info/www/bejne/aWOoo6l-lNKSpNE&t ETL Incremental Data Load (Destination Change Comparison): kzbin.info/www/bejne/l5C3aauIlqiCbJI How to install Apache Airflow: kzbin.info/www/bejne/qmXLZampirGqfKc
@beastmaroc7585
@beastmaroc7585 Жыл бұрын
How to load big tables example of 30M rows from SQL to PG ? Trying to solve load speed and memory usage
@BiInsightsInc
@BiInsightsInc Жыл бұрын
​@@beastmaroc7585 You can horizontally scale your set up to cluster if your data exceeds the resources of a single machine. This will allow you to process large datasets as cluster nodes will offer more compute and memory resources. You can also use a distributed engine like Spark and/or kafka to process large datasets. I have discussed the Airflow execution and cluster based approach here. Feel free to check it out. kzbin.info/www/bejne/f5-aq6qmZat3jpY&ab_channel=BIInsightsInc
@mredmister3014
@mredmister3014 28 күн бұрын
You have perfect cadence and clarity in your training. Short yet complex. Very cool and subscribed.
@AbdelhadiChajia-b8d
@AbdelhadiChajia-b8d 2 ай бұрын
Thank you sir, you helped me understand airflow, and I did the same thing following the same process but from mysql - extract-load -> transformation -> load with free employees database and I did share it on my github and linkedin tagging this video.
@tevintemu105
@tevintemu105 2 жыл бұрын
I want to thank you for posting this content. It is helpful in many ways.
@lasopainstantaneasuprema6861
@lasopainstantaneasuprema6861 7 ай бұрын
simple and godlike understandable 10/10
@sharanchandrasekaran8399
@sharanchandrasekaran8399 Жыл бұрын
Excellent video sir, thank you.
@demohub
@demohub Жыл бұрын
This video was a great resource. Thanks for the tutelage and your take on it.
@flyingdutchman9804
@flyingdutchman9804 Жыл бұрын
Fantastic presentation!
@franciscmoldovan2153
@franciscmoldovan2153 2 жыл бұрын
Very nice videos and blog! Keep up the good work!
@thalesdefaria
@thalesdefaria 2 жыл бұрын
Great content, so pragmatic!
@GiovanniDeCillis
@GiovanniDeCillis 2 жыл бұрын
Hey, this is really helpful. It would be even more insightful if you provided or suggested ways to run this process (along with those described in this recent series of tutorials) in the cloud or in a server less environment. Thanks!
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Hey Giovanni thanks for stopping by. We've covered AWS data lake in the last few videos and will cover ETL topics going forward. I will be sure to include Airflow and how to process data from S3 to Redshift with it. Stay tuned.
@GiovanniDeCillis
@GiovanniDeCillis 2 жыл бұрын
@@BiInsightsInc thank you indeed. I completely missed the data lake video. Thanks that is really helpful!
@Farisito
@Farisito 3 күн бұрын
Thanks
@agussyarif8429
@agussyarif8429 2 жыл бұрын
this is powerful knwoledge
@bandedecodeurs
@bandedecodeurs Жыл бұрын
Very good video. Thank you !
@rjrmatias
@rjrmatias Жыл бұрын
excellent tutorial, thank you ! it would be great if you could split the tasks in several files, need to learn how to do this
@letsplaionline
@letsplaionline 2 жыл бұрын
Great explanation! Thanks!
@Levy957
@Levy957 2 жыл бұрын
Amazing
@crisybelkis
@crisybelkis Жыл бұрын
Thank you a lot. I'm trying to understand hot to create pipeline. I want to be expert on this and be a good Data Engineering. Professional.
@fashiuddin7911
@fashiuddin7911 Жыл бұрын
Great great great
@Obtemus
@Obtemus 2 жыл бұрын
Great playlist. Your method of building the videos is very practical and lovable. One Question: How can you perform the "paste" line by line in the recording? is it ctrl+y after so many ctrl+z ops?
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Glad you find it helpful. Yes, you're right. I write the code and then it's ctrl+y and ctrl+z as I am going through the code. It takes practice and sometimes few retakes to get it right.
@Obtemus
@Obtemus 2 жыл бұрын
@@BiInsightsInc that's so nice. I asked because I found a very useful way to build the code during the video
@saadlechhb3702
@saadlechhb3702 4 ай бұрын
Thank you, i have a question If the task is scheduled to run daily and new data has been inserted into the source since the last transfer, will just new data get transferred next task or all data again
@BiInsightsInc
@BiInsightsInc 4 ай бұрын
This would bring in all data. This is a truncate and load approach. If you need to bring only newly inserted data then you want to look into the incremental data load approache(s). I have covered those in the following videos: kzbin.info/www/bejne/l5C3aauIlqiCbJI kzbin.info/www/bejne/aWOoo6l-lNKSpNE
@saadlechhb3702
@saadlechhb3702 4 ай бұрын
@@BiInsightsInc i can make scripte that merge between the incremental load and aiflow ?
@rajansonvane488
@rajansonvane488 Жыл бұрын
Thanks !!❤
@ryanschraeder8681
@ryanschraeder8681 3 ай бұрын
What happens if you kill the airflow web server, or localhost? Will the DAG still run on the schedule you specified?
@BiInsightsInc
@BiInsightsInc 3 ай бұрын
If the services are down then DAG won’t run. You want to make sure your server remains on for the DAG to execute on schedule.
@CriticalThinker0890
@CriticalThinker0890 Жыл бұрын
why didn't you transform the data after you extract the data from mssql itself and then load the final data to postgresql ?
@BiInsightsInc
@BiInsightsInc Жыл бұрын
It's better to break the each operation in a different task so it is easier to manage and troubleshoot. This in turns results in a neater dag structure extract >> transform >> load. You can write the entire operation in a single function but it would be hard to maintain and debug.
@nabarajghimire610
@nabarajghimire610 Жыл бұрын
Plz make video about installation of airflow .I faced many issues in windows 10/11 like localhost:8080 not open properly lacking many GUI feature
@BiInsightsInc
@BiInsightsInc Жыл бұрын
I have covered the Airflow's installation as a docker container. You can follow the steps outline in the following video. Docker takes the OS system out of the equation and you will get the same Airflow functionaly not matter what system you are running. kzbin.info/www/bejne/qmXLZampirGqfKc
@eunheechoi3745
@eunheechoi3745 11 ай бұрын
is it better to extract the entire tables to the stagings and then final table with consolidated data ?? why? wouldn't recommend to have queried data first (extract) and then transforming and loading it to the final table? I am trying to understand why the first approach is better over the latter....
@BiInsightsInc
@BiInsightsInc 11 ай бұрын
If you are referring to the technique showcased in the video then it is following the dimensional modeling approach. After staging we are building a proper dimension table. With the basic ETL you would extract, transform and load the tables. I'd advise to pick up a good book on ETL, dimensional modeling if you're curious. Hope this helps.
@eunheechoi3745
@eunheechoi3745 11 ай бұрын
@@BiInsightsInc Thank you so much for the explanation and advice. Can you please recommend the books/tutorials/courses which have practical hands on projects and where I can delve into data engineering space. I have worked as a quant/data analytics for about 5 + years and expanding my skills/knowledge in data engineering
@parikshitchavan2211
@parikshitchavan2211 Жыл бұрын
Hello Sir Thanks for such a great tutorial everting you made smooth like butter thanks for that ,just one question whenever we made new DAG ( we will have to add docker-compose-CeleryExecutor, docker-compose-LocalExecutor, and Config for that particular DAG )
@BiInsightsInc
@BiInsightsInc Жыл бұрын
Hi Parikshit, if you have mounted the host folder “dag” to the docker container, just as shown in the tutorial, then dags created in this folder will automatically appear in your docker UI and will be copied over to appropriate containers.
@parikshitchavan2211
@parikshitchavan2211 Жыл бұрын
Thanks sir for clearing confusion please keep uploading videos 😀🙏
@ChaimaHermi-zi8pq
@ChaimaHermi-zi8pq Жыл бұрын
Could you please provide more details on what you mean by "table fact of the datawarehouse"?
@BiInsightsInc
@BiInsightsInc Жыл бұрын
Can you please provide the timeline in the video for the context.
@babakbarzegari9228
@babakbarzegari9228 Жыл бұрын
it was awesome, I had a blast, Can a video be created for ETL SQL Server Tables to Azure Blobs with Airflow, you never worked around Azure Blob
@BiInsightsInc
@BiInsightsInc Жыл бұрын
Thanks Babak! I will try and explore Azure.
@babakbarzegari9228
@babakbarzegari9228 Жыл бұрын
@@BiInsightsInc Thank you so much! Looking forward to it with great anticipation. Please ping me here when it done.
@mrg807
@mrg807 Жыл бұрын
Thanks for the video. One question is why you didn't use Airflow's built-in PostgresHook or PostgreSQLOperator instead of SQLAlchemy and Pandas. I think this would simplify the code and make it more consistent with the way the SQL server connection is established using MsSqlHook.
@BiInsightsInc
@BiInsightsInc Жыл бұрын
This is a continuation of the Python ETL series. This session orchestrate what we developed previously and the (E*T*L) transformations are carried out with Pandas. I am not sure if PostgresHook offers same capabilities as Pandas therefore, I went with Pandas. I could've used the PostgresHook to get a dataframe from Postgres but not sure if I can persist dataframe as easily as SQLAlchemy. In addition, If I were to use the PostgreSQLOperator then I would've to switch to SQL as oppose to Python.
@alonsom.donayre1992
@alonsom.donayre1992 Жыл бұрын
@@BiInsightsInc I guess this is just an example of how to use the hooks, cause airflow is not a processing framework but an orchestrator. Transformation should be handle by an external source like Spark, Database engine etc
@BiInsightsInc
@BiInsightsInc Жыл бұрын
@@alonsom.donayre1992 Yes, it is an orchestrator however with TaskFlow API 2.0, which is used here, they're claiming you can carry out the execution (transformation) within Airflow. I am not sold on it because it is designed for small to medium jobs, unless you are working with a cluster. I am waiting for Dagster to mature as it can handle orchestration and processing. More to come on Dagster. So yeah I would advise to process large datasets with an external database engine or distributed engines like Spark & Glue. airflow.apache.org/docs/apache-airflow/stable/tutorial/taskflow.html
@hellowillow_YT
@hellowillow_YT 8 ай бұрын
Hi, Sir! Thanks for the insightful video! However I'd like to ask if we need to place the ETL python file in a particular folder for it to be recognized as a DAG by Airflow?
@BiInsightsInc
@BiInsightsInc 8 ай бұрын
Yes, you need to put the DAG files in the dags folder. This folder gets mounted in the docker image.
@hellowillow_YT
@hellowillow_YT 8 ай бұрын
Understood. Thank you, Sir! @@BiInsightsInc
@abnormalirfan
@abnormalirfan 2 жыл бұрын
Hi, how dags airflow mssql to gcs please, i will to build data warehouse in bigquery.
@shrutibangad8112
@shrutibangad8112 2 жыл бұрын
Question: Do you have a video to build pipeline to move data from Postgres Server to SQL Server?
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Hi Shruti, I think we can reverse the order of the source and destination and it will give you the code to move from Postgres to SQL Server. Hope this helps!
@beastmaroc7585
@beastmaroc7585 Жыл бұрын
Can you share a tip for big tables 30M rows ... I have an issue with Df and memory usage
@BiInsightsInc
@BiInsightsInc Жыл бұрын
I have done a video on this topic sinc a lot of folks were raising this. Also, look into big data processing frameworks design for large data processing i.e., Dask, Kafka, Spark and Flink. I have covered Spark and Kafka on this channel. kzbin.info/www/bejne/bnLanGt5pcuYfcU&ab_channel=BIInsightsInc
@tahirrana786
@tahirrana786 2 жыл бұрын
👍
@isbakhullail6693
@isbakhullail6693 2 жыл бұрын
Do you use dokcer for airflow or not? is there an installation video?
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
I have Airflow running on Docker. Here is the link to Airflow installation video: kzbin.info/www/bejne/qmXLZampirGqfKc
@eunheechoi3745
@eunheechoi3745 11 ай бұрын
Was able to connect to import airflow.providers.oracle and oracle hook. However, when I use OracelHook, it keeps throwing an error saying ‘conn_id’ not found even thought the connection has been configured fine via the airflow UI. Do you have any idea? What could go wrong ?
@BiInsightsInc
@BiInsightsInc 11 ай бұрын
It's not able to find the connection configured in the Airflow. You need to define the connection, test and try agian.
@ulhasbhagwat4942
@ulhasbhagwat4942 2 жыл бұрын
in load_src_data, can we not use postgres connection object(conn) to load the data instead of using create_engine? because we need to know all the same connection details again which we used to create connection id in airflow
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Hi Ulhas, when we use df.to_sql, we need to pass in a SQLAlchemy engine and not a standard connection object (and not a cursor either). If you try it on your end it will throw an error. Hope this helps.
@dzakwandaffahidayatullah3125
@dzakwandaffahidayatullah3125 2 жыл бұрын
do you have any suggestions for me to run etl python in command line ubuntu server? to read a csv file or a connection with a pre-existing mysql database?
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Hi Dzakwan, I have done a video on querying existing mysql database. You can follow along to establish database connection and query the database. Once you write and test the script then save as Python file i.e. etl.py. You can type: Python etl.py in the terminal to execute the script. Hope this helps. kzbin.info/www/bejne/o3zRfoxofNaUfas
@alanamantinodasilva4362
@alanamantinodasilva4362 2 жыл бұрын
Can you ask me a question? I need to transfer data from Azure SQL server to MSSQL Server (cloud to server). I can do it directly from source and load to destiny, or need to land in postgre?
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Hi Alan, you can directly load the data from Azure SQL to SQL Server. There’s no need to load data to PostgreSQL.
@alanamantinodasilva4362
@alanamantinodasilva4362 2 жыл бұрын
​@@BiInsightsInc can you indicate to me a content that explain this, i loved your vídeo, but is so much complex to me, i just need to transfer data, from source to destiny. (sorry about my english).
@alanamantinodasilva4362
@alanamantinodasilva4362 2 жыл бұрын
@@BiInsightsInc i tried with generic transfer, but this limit the transfer to 1000 rows por json file and take a lot of time to transfer.
@eunheechoi3745
@eunheechoi3745 Жыл бұрын
Hi, can you please make a video about python ETL pipeline with Airflow for extracting data from oracle sql and loading to postgres?
@BiInsightsInc
@BiInsightsInc Жыл бұрын
I think you all you need to is connect to orcale database and rest of the process should be similar. Here is a link to oracledb library that showcases how to establish connection to oracle database. python-oracledb.readthedocs.io/en/latest/user_guide/connection_handling.html
@eunheechoi3745
@eunheechoi3745 Жыл бұрын
@@BiInsightsIncthanks for the reply. I’ve already know how to connect to oracle db in Python itself but was trying to connect oracle directly to airflow by using providers. No luck so far. Can you please make a video how to install airflow provider for oracle and how to extract data from oracle and insert data to Postgres ?
@BiInsightsInc
@BiInsightsInc Жыл бұрын
@@eunheechoi3745 I will cover the oracle provider with airflow soon.
@eunheechoi3745
@eunheechoi3745 11 ай бұрын
@@BiInsightsInc thanks you so much! actually, I've figured it out. It was just including airflow.providers.oracle in the docker-compose.yaml file... but I suppose this is not recommended for production. Can you show better approach to install the oracle provider like how to install cx_Oracle with instant client etc...? or alternatives... Also, after connection to oracle in airflow is established, I am having an error of UnicodeDecode error UTF-8 when I try to get the entire table from oracle... do you know how to fix it?
@BiInsightsInc
@BiInsightsInc 11 ай бұрын
@@eunheechoi3745 You're on the right track to install the oracle providers. However, don't include it in the docker compose file rather build a custom image with additional libraries. I have covered how to build a custom image with additional libraries/providers here: kzbin.info/www/bejne/f5-aq6qmZat3jpY&t
@varshachauhan5185
@varshachauhan5185 2 жыл бұрын
Thank you so much for the video, I have one question ,suppose i have 1000 different source files how to compare this file with one table present in database
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Hi Varsha thanks for stopping by. If the files structure is the same then you can combine them into a DataFrame. Then query the database. From there is just comparing the two DataFrames. Hope this helps. Also, if you are trying to do change detection for incremental load then I will be doing a video on how to perform incremental load. Stay tuned.
@varshachauhan5185
@varshachauhan5185 2 жыл бұрын
@@BiInsightsInc Thank you for your reply . I will wait for your video and request u to make more videos on operations which can be performed using pandas dataframe
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
@@varshachauhan5185 I have a video on the Pandas library with various tips and trips. Feel free to check it out: kzbin.info/www/bejne/Y5vIo62cnZeViZI
@mycousins896
@mycousins896 Жыл бұрын
why didnt you show how you connect to mysql? please make a video on that!
@BiInsightsInc
@BiInsightsInc Жыл бұрын
As you can see that I am using SQL Server as the source and Postgres as the target therefore, MySQL is out of the scope. This is why it is not covered. Here is screenshot on how to connect to MySQL via Airflow: github.com/hnawaz007/pythondataanalysis/blob/main/AirflowSession2/airflow_mysql.png
@mycousins896
@mycousins896 Жыл бұрын
@@BiInsightsInc Thanks, how about sql server? Can you make that as well? And how did you host your db to that ip address in the picture? Im asking these because there really isnt any good tutorials out there for stuff like these. Many people underestimate how difficult sometimes it is to make connections work between these tools, while the actual development on the other hand is most of times easy. That is a really gap that you could fill by making some videos on how to make it work between docker, microsoft sql server/myssql etc., airflow, and write back to connected database
@BiInsightsInc
@BiInsightsInc Жыл бұрын
@@mycousins896 Check out the Airflow installation with SQL Server provider. I go over on how to install SQL Server provider and how to create a SQL Server connection. How to install Apache Airflow on Docker? | Build Custom Airflow Docker Image | Airflow | Docker kzbin.info/www/bejne/qmXLZampirGqfKc
@WEN_the_cat_Astonot
@WEN_the_cat_Astonot 2 жыл бұрын
sql server to gcs with airflow please
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Will make a video on this topic soon. Thanks
@WEN_the_cat_Astonot
@WEN_the_cat_Astonot 2 жыл бұрын
​@@BiInsightsInc I'm really waiting for that
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
@@WEN_the_cat_Astonot SQL Server to GCP video is up. kzbin.info/www/bejne/hKKbhZduncSmsM0&t
@CSE-AshikAhamedP
@CSE-AshikAhamedP Жыл бұрын
but why you importing pandas
@BiInsightsInc
@BiInsightsInc Жыл бұрын
Pandas is used to query the database and carry out the transformation.
@esmailpaltasingh9691
@esmailpaltasingh9691 Жыл бұрын
How to connect to Sybase db
@BiInsightsInc
@BiInsightsInc Жыл бұрын
You can use sybase library in Python to connect to a Sybase database. If you want to use SQLAlchemey then use the following method to connect to a Sybase database. # Sybase library import Sybase db = Sybase.connect('server','name','pass','database') c = db.cursor() c.execute("sql statement") list1 = c.fetchall() #SQLAlchemey params = ( "DRIVER = "+driver+";"\ "SERVER = "+server+";"\ "DATABASE = "+database+";"\ "PORT = "+port+";"\ "UID = "+user+";"\ "PWD= "+password+";" params = urllib.parse.quote_plus(params) connexion_string = 'sybase+pyodbc:///?odbc_connect = %s'%params)
@ardavanmoinzadeh801
@ardavanmoinzadeh801 2 жыл бұрын
Hi, Can you share the source code ?
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Please check the description of the video. All the resources including link to the repo is there. Thanks
@ardavanmoinzadeh801
@ardavanmoinzadeh801 2 жыл бұрын
@@BiInsightsInc yes.. I commented too fast :) found them and forgot to fix my comment!! thanks!
@bigdataon-premorcloud7866
@bigdataon-premorcloud7866 2 жыл бұрын
how to add the "Conn Type" of mssql? kzbin.info/www/bejne/m4vJdWmubqt9aqc
@BiInsightsInc
@BiInsightsInc 2 жыл бұрын
Check out the Airflow installation with SQL Server provider. I go over on how to create a SQL Server connection. How to install Apache Airflow on Docker? | Build Custom Airflow Docker Image | Airflow | Docker kzbin.info/www/bejne/qmXLZampirGqfKc
🍉😋 #shorts
00:24
Денис Кукояка
Рет қаралды 3,8 МЛН
From Small To Giant Pop Corn #katebrush #funny #shorts
00:17
Kate Brush
Рет қаралды 72 МЛН
Minecraft Creeper Family is back! #minecraft #funny #memes
00:26
Офицер, я всё объясню
01:00
История одного вокалиста
Рет қаралды 5 МЛН
Airflow DAG: Coding your first DAG for Beginners
20:31
Data with Marc
Рет қаралды 226 М.
Don't Use Apache Airflow
16:21
Bryan Cafferky
Рет қаралды 94 М.
ETL with Python
57:19
Mean, Median and Moose
Рет қаралды 65 М.
Airflow for Beginners - Run Spotify ETL Job in 15 minutes!
16:38
Karolina Sowinska
Рет қаралды 142 М.
Master Apache Airflow: 5 Real-World Projects to Get You Started
13:18
Darshil Parmar
Рет қаралды 10 М.
Airflow for Beginners: Build Amazon books ETL Job in 10 mins
13:13
Sunjana in Data
Рет қаралды 7 М.
🍉😋 #shorts
00:24
Денис Кукояка
Рет қаралды 3,8 МЛН