What's new in Apache Airflow 2.10?
15:50
What's new in Apache Airflow 2.9?
13:24
What's new in Apache Airflow 2.8?
9:58
The New Airflow UI Explained
15:18
What's new in Apache Airflow 2.7?
12:55
What's new in Apache Airflow 2.6?
9:53
What's new in Apache Airflow 2.5?
7:52
Пікірлер
@minkx09
@minkx09 Күн бұрын
How to integrate with LDAP in this version? I couldn't find 🥲
@kenamia9136
@kenamia9136 4 күн бұрын
I prefer the first method. Easier to read.
@eduardoamfm
@eduardoamfm 5 күн бұрын
awesome class tks
@ashimov1970
@ashimov1970 5 күн бұрын
Excellent content terrible accent
@marc2223
@marc2223 5 күн бұрын
It’s the 🥐 accent, the more you hear it the more you love it ❤
@ashimov1970
@ashimov1970 5 күн бұрын
@@marc2223 too hard to understand him
@RemiAdeleye
@RemiAdeleye 6 күн бұрын
This was a great tutorial - easy to conceptualize and a great example which was easy to follow. It's a great feeling to look at your data team's airflow github repo and actually understand what the DAGs are doing when you didn't know how airflow worked an hour ago!
@rahul-x3y6l
@rahul-x3y6l 7 күн бұрын
Hi , I am trying to run dbt transforms in Airflow, my table raw_invoices is already created in US location. still I am getting below error in Airflow. Error in model dim_customer (models/transform/dim_customer.sql)\x1b[0m', '\x1b[0m20:02:13 404 Not found: Table airflow-project-446808:retail.raw_invoices was not found in location US; Can you suggest here?
@tina_rahi07
@tina_rahi07 7 күн бұрын
Thank you so much for this amazing video! It was very helpful and well-explained. However, I encountered the following error when running my code: airflow.exceptions.AirflowException: Cannot execute: spark-submit --master spark://spark-master:7077 --conf spark.master=spark://spark-master:7077 --num-executors 2 --executor-memory 2g --driver-memory 2g --name arrow-spark --verbose --deploy-mode client /usr/local/airflow/include/scripts/read.py /usr/local/airflow/include/1.csv. Error code is: 1. The only additional part in my code compared to yours is the inclusion of dropna. Could you please help me figure out what might be causing this issue? Thank you in advance!
@darshansolanki-u6q
@darshansolanki-u6q 8 күн бұрын
Backfill cannot happen when the dag is not scheduled ? In that case airflow dags trigger command works, but it will not be able to trigger specific tasks but entire dag
@premsaikarampudi3944
@premsaikarampudi3944 10 күн бұрын
@MarcLamberti : Can you help me fix it? Docker image I am using is 12.6.0-python-3.10. Broken DAG: [/usr/local/airflow/dags/retail.py] Traceback (most recent call last): File "/usr/local/airflow/dags/retail.py", line 12, in <module> from include.dbt.cosmos_config import DBT_PROJECT_CONFIG, DBT_CONFIG File "/usr/local/airflow/include/dbt/cosmos_config.py", line 1, in <module> from cosmos.config import ProfileConfig, ProjectConfig ModuleNotFoundError: No module named 'cosmos'
@premsaikarampudi3944
@premsaikarampudi3944 10 күн бұрын
if you can re-run the project with latest docker image with python version 3.10, I guess most of the questions from everyone will be addressed
@premsaikarampudi3944
@premsaikarampudi3944 11 күн бұрын
Cant Astro just incorporate utg-8 encoding parameter as part of it's load file method?
@PhuocJoshDang
@PhuocJoshDang 11 күн бұрын
Can I pls ask that why didnt we just use dbt for quality checks. What the advantages of soda over dbt
@oreallyseven
@oreallyseven 11 күн бұрын
Thanks for the great video, By way is the any possibility to do push and pull request to git from the airflow, could please share any useful resource to go through. Thanks!
@oreallyseven
@oreallyseven 11 күн бұрын
This video is very informative, But Is there any way to pull and push to the git repo, if there is an resource is already there, can you please share me the link or can you please share any resource that would be very helpfull, Thanks!
@dalicodes
@dalicodes 12 күн бұрын
Done it thank you Marc. How can I query the table using athena. I tried cataloging the data using glue but crawlers can only detect general purpose buckets.
@afzalandthedreams
@afzalandthedreams 13 күн бұрын
Hi Marc!, I noticed you used the BashOperator in the example pipeline @13:16 Could you please explain why you chose BashOperator over PythonOperator in this case? Also, how do you typically decide which operator to use when designing your DAGs? Thank you again for sharing your knowledge...I really appreciate your efforts!
@azriabdrahim7040
@azriabdrahim7040 13 күн бұрын
Thanks Marc for the great video 🎉🎉. Sorry I am a beginner, just out of curiosity, how much $ does AWS charge you for this project?
@gaurav54001
@gaurav54001 14 күн бұрын
Hi Marc, After selecting the python interpreter still my autocomplete doesnt work. I have tried switching between all the available one in my machine, any suggestion.
@eduardofarias87
@eduardofarias87 15 күн бұрын
I imagine this solution in a different way. I thought it would be just putting a file or updating a file in the dataset directory (or the file) without the need for a "producer" dag. I have an external system and not a DAG that unloads the data into a directory. For me, the "TriggerDagRunOperator" already fulfills exactly the same role as the Datasets. Help me understand if that's not it. Thanks for the content Marc!
@XiaomingZhou-t5f
@XiaomingZhou-t5f 15 күн бұрын
could you pls share knowledge on how to integrate airflow v2 with custom oauth2 provider (not in the supported list from the airflow doc) ?
@hussainshaik368
@hussainshaik368 22 күн бұрын
Hey marc , I'm a beginner. loved your content. It would be really helpful if you could number the videos in this playlist to make it easier going with the right order.
@MarcLamberti
@MarcLamberti 16 күн бұрын
Good idea
@gtonizuka9990
@gtonizuka9990 26 күн бұрын
Thanks for the video ! Is soda essentiel and how much does it cost, i couldnt find the price on their website
@hicks_dwaynes
@hicks_dwaynes 27 күн бұрын
really thanks for lesson)
@palodans1217
@palodans1217 Ай бұрын
Good job.
@BuiTung-nw5xw
@BuiTung-nw5xw Ай бұрын
Sir, when I try to inspect the network, these Airflow containers and other containers are on a different network, even though I followed the instructions. Could you please post a video to help fix this?
@MarcLamberti
@MarcLamberti Ай бұрын
Look at the notion page in description
@BuiTung-nw5xw
@BuiTung-nw5xw Ай бұрын
​@@MarcLamberti although i followed the instruction in the notion page, these still use difference network
@jacopomalatesta1522
@jacopomalatesta1522 22 күн бұрын
@@BuiTung-nw5xw Have you A) explicitly defined the network in the Docker compose file and B) passed the network to each service?
@zaminhassnain7570
@zaminhassnain7570 Ай бұрын
Great this worked fine locally. But when I tried to deploy the same code for AF deployed on using cloud.astronomer instance using celery executor, Th DAG is failing with this exceptions: raise AirflowException( airflow.exceptions.AirflowException: Cannot execute: spark-submit --master spark://spark-master:7077 --name arrow-spark --verbose --deploy-mode client ./include/scripts/read.py. Error code is: 1
@hicks_dwaynes
@hicks_dwaynes Ай бұрын
Thanks for lesson. I have a problem - TypeError: _choose_best_model() missing 1 required positional argument: 'ti'. Can you help?) returned values 8, 9 and 6
@MarcLamberti
@MarcLamberti Ай бұрын
Put ti=none in the parameters
@hicks_dwaynes
@hicks_dwaynes 28 күн бұрын
@@MarcLamberti Thanks, i changed Airflow 1.0 to 2.0 and had success))
@xavierangaleu
@xavierangaleu Ай бұрын
i am trying to upload the csv file to google cloud bucket but am getting a TIMEOUT ERROR. I tried uploading another file that was less heavy it worked. please can someone help me @MarcLamberti
@NamrataAnandkumarHurukadli
@NamrataAnandkumarHurukadli Ай бұрын
Hey Marc, I have deferred tasks in my DAG and when I use the backfill command to it, the task isn't going to deferred state, it's just marking all the tasks as successful ..... does the backfill command not work with deferrable operators ??
@LoveDarkChocolate
@LoveDarkChocolate Ай бұрын
I would like to ask you about the DAG being scheduled on a Dataset or Dataset Alias. I assume that such DAG will only run if those datasets are modified by another DAG, not by any event that modifies those files outside the Airflow architecture?
@MarcLamberti
@MarcLamberti Ай бұрын
Correct
@tomasoon
@tomasoon Ай бұрын
Is it possible to run pyspark jobs from azure synapse? (for example a notebook)?
@MarcLamberti
@MarcLamberti Ай бұрын
Didn’t try yet
@forgottenvy
@forgottenvy Ай бұрын
how to get the compose file with the latest images? Is the doc updated regularly or I grab latest images manually from docker hub and customize my compose file
@MarcLamberti
@MarcLamberti Ай бұрын
use the doc yes :)
@prakashraj4264
@prakashraj4264 Ай бұрын
I’ve one doubt sir what’s the usage of Installing airflow any specific reason for that ? And in what kind of scenario can we use this one ?
@Clement-r7t
@Clement-r7t Ай бұрын
Great tutorial, thanks Marc!
@ashutosh4912
@ashutosh4912 Ай бұрын
Now waiting for video where we deploy multinode worker of spark and airflow with Kubernetes in production 🎉
@MrBestshorty
@MrBestshorty Ай бұрын
Hey great video! Could you show us how to set this up using Kubernetes and helm?
@MarcLamberti
@MarcLamberti Ай бұрын
Tell me more :)
@RushikeshMule-p4b
@RushikeshMule-p4b Ай бұрын
If you're using Apprise, it's possible to tag people on Teams?
@MarcLamberti
@MarcLamberti Ай бұрын
I think so
@ashutosh4912
@ashutosh4912 Ай бұрын
They are in same network i verified using inspect command🍖 able to ping spark-master from airflow webserver as well
@MarcLamberti
@MarcLamberti Ай бұрын
here we go :)
@ashutosh4912
@ashutosh4912 Ай бұрын
Getting this error raise PySparkRuntimeError( pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
@Ferdi-g1q
@Ferdi-g1q Ай бұрын
Hey Marc, great video! Indeed that seems to be really easy to play around with Spark and Airflow in a dev environment with your explanation, but I have 3 questions though: (1) At the end of the video when you present the TaskFlow API way of triggering a PySpark job, you import pandas at a "top-level import" which is not really recommended if I am not wrong (should be imported within the function of the task)? I get that you need to do so since your function is returning a pandas DataFrame though :). (2) Same context as before: your function returns a pandas DF, it will be returned as a XCOM right? I remember that XCOM used to be used to pass around "little amount" of data from tasks to tasks, but I think this limitation is gone given you can set an XCOM backend to be a cloud object store (GCS, S3, ABFS). But do we really want to do that given the fact that Spark is designed to deal with big amount of data (although I totally get that it can make sense for small amount of data as showcased in your example) ? To rephrase my question: should not it be better to code your PySpark job so that the result is written in a cloud object storage instead of passing it around through XCOM? (3) In a higher environment (prod), would you still deploy Spark as a Kubernetes application (docker in your case but it is a container), or would it better to leverage the fact that PySpark applications can use Kubernetes as a cluster manager? Thanks again for the great video :)
@MarcLamberti
@MarcLamberti Ай бұрын
Lot of great remarks here :) 1) It depends. You have some imports that natually take time (like numpy). My recommendation is to make a local import if you use that import in one task otherwise you can make it top-level IF it's not a heavy import (like numpy again). You can verify that with the parsing time. 2) Absolutely. Here the data is very small so it doesn't matter. But if you have large data, then it's better to offload the work that Spark. That being said, you should always know what you're doing. It's ok to pass/process data through Airflow tasks as long as you know the limitations of your architecture/resources. 4) I prod, I would deploy Spark as a Kubernetes application and use Spark Connect as explained here airflow.apache.org/docs/apache-airflow-providers-apache-spark/stable/decorators/pyspark.html#spark-connect <3
@sseeer-r5d
@sseeer-r5d Ай бұрын
Nice, need to YARN + TEZ and HDFS and will be grate e2e project
@fixxer47
@fixxer47 Ай бұрын
i dont have memberOf attribute in user's internal atrribute. How did you configure your LDAP so it creates memberOf attribute?
@JohnGunvaldson
@JohnGunvaldson Ай бұрын
I think this is going to work great. I have a hundred or so PHP files to schedule and call from Airflow leveraging SSHOperator, and these calls only differs in name of php file, and sometimes location of php file, all other tasks are the same (solves a legacy issue left over from years past)... I can control scheduling in batches, and arrange for any specifics with properties...
@KarlKloppenborg
@KarlKloppenborg 2 ай бұрын
Loving your energy!
@ruksanabegum1895
@ruksanabegum1895 2 ай бұрын
You are simple "The Best" <3
@starlord9109
@starlord9109 2 ай бұрын
Hey.. Marc Now please tell ne how to stop repetition of tasks as no one told in comments 😂
@abhipoornisnsadvocate5158
@abhipoornisnsadvocate5158 2 ай бұрын
Hi Marc, Glue job finishes in 5 minutes but it shows as running in astro airflow local instance
@JyotiMalik-f5u
@JyotiMalik-f5u 2 ай бұрын
I am unable to restart the airflow instance after 29:30 , can anyone plzzzzzz help
@mohd_fawad
@mohd_fawad 2 ай бұрын
Same here. @MarcLamberti would you be able to help us?
@sewingwithcope
@sewingwithcope 2 ай бұрын
Wow this was such a great tutorial! Very easy to understand and can’t wait to try it myself!