Data with Marc

Data with Marc

This channel is to help data engineers start or improve their skills around the most exciting data engineering tools.
I'm obsessed with providing high-value content that is highly engaging. Engineering courses don't have to be boring! Let's stop the curse! Expect to see a lot of hands-on tutorials, mainly about Apache Airflow, but not only. I also love training and public speaking.
Oh, Subscribe; all excellent engineers subscribe to a fascinating KZbin channel like this one 🥹.

Feel free to follow me on LinkedIn, where I post daily content on Airflow: www.linkedin.com/in/marclamberti/

How to run PySpark with Apache Airflow: The new way!

15:17

How to run PySpark with Apache Airflow: The new way!

Ай бұрын

Mini Data Engineering Project: Monitor Apache Airflow with Airbyte, Snowflake, and Superset

39:13

Mini Data Engineering Project: Monitor Apache Airflow with Airbyte, Snowflake, and Superset

3 ай бұрын

What's new in Apache Airflow 2.10?

15:50

What's new in Apache Airflow 2.10?

4 ай бұрын

Airflow DAG Factory: Create DAGs dynamically with YAML

16:00

Airflow DAG Factory: Create DAGs dynamically with YAML

5 ай бұрын

Backfill your DAGs in Apache Airflow: Everything you need to know

23:14

Backfill your DAGs in Apache Airflow: Everything you need to know

6 ай бұрын

The NEW way to ingest data with Airflow in a single task!

20:59

The NEW way to ingest data with Airflow in a single task!

7 ай бұрын

What's new in Apache Airflow 2.9?

13:24

What's new in Apache Airflow 2.9?

9 ай бұрын

Airflow Notifications with Microsoft Teams and Notifiers

14:09

Airflow Notifications with Microsoft Teams and Notifiers

10 ай бұрын

Airflow Tutorial: Create Data Pipelines with No Airflow Knowledge!

7:48

Airflow Tutorial: Create Data Pipelines with No Airflow Knowledge!

Жыл бұрын

What's new in Apache Airflow 2.8?

9:58

What's new in Apache Airflow 2.8?

Жыл бұрын

The New Airflow UI Explained

15:18

The New Airflow UI Explained

Жыл бұрын

Configure VS Code to Develop Airflow DAGs with Docker at ease!

11:01

Configure VS Code to Develop Airflow DAGs with Docker at ease!

Жыл бұрын

Getting Started with Airflow for Beginners

16:00

Getting Started with Airflow for Beginners

Жыл бұрын

What is Apache Airflow? For beginners

11:50

What is Apache Airflow? For beginners

Жыл бұрын

The ExternalPythonOperator: No more dependency conflicts in Apache Airflow

11:42

The ExternalPythonOperator: No more dependency conflicts in Apache Airflow

Жыл бұрын

Why Airflow? The Top 5 Reasons To Use It!

10:01

Why Airflow? The Top 5 Reasons To Use It!

Жыл бұрын

Airflow Tutorial: Running Data Quality Checks with Snowflake and Soda

28:06

Airflow Tutorial: Running Data Quality Checks with Snowflake and Soda

Жыл бұрын

What's new in Apache Airflow 2.7?

12:55

What's new in Apache Airflow 2.7?

Жыл бұрын

Data Engineer Project: An end-to-end Airflow data pipeline with BigQuery, dbt Soda, and more!

1:01:31

Data Engineer Project: An end-to-end Airflow data pipeline with BigQuery, dbt Soda, and more!

Жыл бұрын

Managed Apache Airflow With Astro! Goodbye MWAA and GCC

16:04

Managed Apache Airflow With Astro! Goodbye MWAA and GCC

Жыл бұрын

What's new in Apache Airflow 2.6?

9:53

What's new in Apache Airflow 2.6?

Жыл бұрын

Airflow Quickstart Project with DuckDB, MinIO and Streamlit

13:38

Airflow Quickstart Project with DuckDB, MinIO and Streamlit

Жыл бұрын

Airflow DAG dependencies: The Datasets, TriggerDAGRunOperator, and ExternalTaskSensor

17:15

Airflow DAG dependencies: The Datasets, TriggerDAGRunOperator, and ExternalTaskSensor

Жыл бұрын

Airflow with DBT tutorial - The best way!

17:54

Airflow with DBT tutorial - The best way!

Жыл бұрын

What They Don't Tell You About Apache Airflow

11:10

What They Don't Tell You About Apache Airflow

Жыл бұрын

Installing and Running Airflow on a small AWS EC2 Instance

14:11

Installing and Running Airflow on a small AWS EC2 Instance

2 жыл бұрын

DuckDB Tutorial - DuckDB course for beginners

25:53

DuckDB Tutorial - DuckDB course for beginners

2 жыл бұрын

What's new in Apache Airflow 2.5?

7:52

What's new in Apache Airflow 2.5?

2 жыл бұрын

Airflow DAG: Make your data pipelines better!

13:06

Airflow DAG: Make your data pipelines better!

2 жыл бұрын

Пікірлер

@minkx09 Күн бұрын

How to integrate with LDAP in this version? I couldn't find 🥲

@kenamia9136 4 күн бұрын

I prefer the first method. Easier to read.

@eduardoamfm 5 күн бұрын

awesome class tks

@ashimov1970 5 күн бұрын

Excellent content terrible accent

@marc2223 5 күн бұрын

It’s the 🥐 accent, the more you hear it the more you love it ❤

@ashimov1970 5 күн бұрын

@@marc2223 too hard to understand him

@RemiAdeleye 6 күн бұрын

This was a great tutorial - easy to conceptualize and a great example which was easy to follow. It's a great feeling to look at your data team's airflow github repo and actually understand what the DAGs are doing when you didn't know how airflow worked an hour ago!

@rahul-x3y6l 7 күн бұрын

Hi , I am trying to run dbt transforms in Airflow, my table raw_invoices is already created in US location. still I am getting below error in Airflow. Error in model dim_customer (models/transform/dim_customer.sql)\x1b[0m', '\x1b[0m20:02:13 404 Not found: Table airflow-project-446808:retail.raw_invoices was not found in location US; Can you suggest here?

@tina_rahi07 7 күн бұрын

Thank you so much for this amazing video! It was very helpful and well-explained. However, I encountered the following error when running my code: airflow.exceptions.AirflowException: Cannot execute: spark-submit --master spark://spark-master:7077 --conf spark.master=spark://spark-master:7077 --num-executors 2 --executor-memory 2g --driver-memory 2g --name arrow-spark --verbose --deploy-mode client /usr/local/airflow/include/scripts/read.py /usr/local/airflow/include/1.csv. Error code is: 1. The only additional part in my code compared to yours is the inclusion of dropna. Could you please help me figure out what might be causing this issue? Thank you in advance!

@darshansolanki-u6q

@darshansolanki-u6q 8 күн бұрын

Backfill cannot happen when the dag is not scheduled ? In that case airflow dags trigger command works, but it will not be able to trigger specific tasks but entire dag

@premsaikarampudi3944

@premsaikarampudi3944 10 күн бұрын

@MarcLamberti : Can you help me fix it? Docker image I am using is 12.6.0-python-3.10. Broken DAG: [/usr/local/airflow/dags/retail.py] Traceback (most recent call last): File "/usr/local/airflow/dags/retail.py", line 12, in <module> from include.dbt.cosmos_config import DBT_PROJECT_CONFIG, DBT_CONFIG File "/usr/local/airflow/include/dbt/cosmos_config.py", line 1, in <module> from cosmos.config import ProfileConfig, ProjectConfig ModuleNotFoundError: No module named 'cosmos'

@premsaikarampudi3944

@premsaikarampudi3944 10 күн бұрын

if you can re-run the project with latest docker image with python version 3.10, I guess most of the questions from everyone will be addressed

@premsaikarampudi3944

@premsaikarampudi3944 11 күн бұрын

Cant Astro just incorporate utg-8 encoding parameter as part of it's load file method?

@PhuocJoshDang 11 күн бұрын

Can I pls ask that why didnt we just use dbt for quality checks. What the advantages of soda over dbt

@oreallyseven 11 күн бұрын

Thanks for the great video, By way is the any possibility to do push and pull request to git from the airflow, could please share any useful resource to go through. Thanks!

@oreallyseven 11 күн бұрын

This video is very informative, But Is there any way to pull and push to the git repo, if there is an resource is already there, can you please share me the link or can you please share any resource that would be very helpfull, Thanks!

@dalicodes 12 күн бұрын

Done it thank you Marc. How can I query the table using athena. I tried cataloging the data using glue but crawlers can only detect general purpose buckets.

@afzalandthedreams

@afzalandthedreams 13 күн бұрын

Hi Marc!, I noticed you used the BashOperator in the example pipeline @13:16 Could you please explain why you chose BashOperator over PythonOperator in this case? Also, how do you typically decide which operator to use when designing your DAGs? Thank you again for sharing your knowledge...I really appreciate your efforts!

@azriabdrahim7040

@azriabdrahim7040 13 күн бұрын

Thanks Marc for the great video 🎉🎉. Sorry I am a beginner, just out of curiosity, how much $ does AWS charge you for this project?

@gaurav54001 14 күн бұрын

Hi Marc, After selecting the python interpreter still my autocomplete doesnt work. I have tried switching between all the available one in my machine, any suggestion.

@eduardofarias87

@eduardofarias87 15 күн бұрын

I imagine this solution in a different way. I thought it would be just putting a file or updating a file in the dataset directory (or the file) without the need for a "producer" dag. I have an external system and not a DAG that unloads the data into a directory. For me, the "TriggerDagRunOperator" already fulfills exactly the same role as the Datasets. Help me understand if that's not it. Thanks for the content Marc!

@XiaomingZhou-t5f

@XiaomingZhou-t5f 15 күн бұрын

could you pls share knowledge on how to integrate airflow v2 with custom oauth2 provider (not in the supported list from the airflow doc) ?

@hussainshaik368

@hussainshaik368 22 күн бұрын

Hey marc , I'm a beginner. loved your content. It would be really helpful if you could number the videos in this playlist to make it easier going with the right order.

@MarcLamberti 16 күн бұрын

Good idea

@gtonizuka9990 26 күн бұрын

Thanks for the video ! Is soda essentiel and how much does it cost, i couldnt find the price on their website

@hicks_dwaynes 27 күн бұрын

really thanks for lesson)

@palodans1217 Ай бұрын

Good job.

@BuiTung-nw5xw Ай бұрын

Sir, when I try to inspect the network, these Airflow containers and other containers are on a different network, even though I followed the instructions. Could you please post a video to help fix this?

@MarcLamberti Ай бұрын

Look at the notion page in description

@BuiTung-nw5xw Ай бұрын

@@MarcLamberti although i followed the instruction in the notion page, these still use difference network

@jacopomalatesta1522

@jacopomalatesta1522 22 күн бұрын

@@BuiTung-nw5xw Have you A) explicitly defined the network in the Docker compose file and B) passed the network to each service?

@zaminhassnain7570

@zaminhassnain7570 Ай бұрын

Great this worked fine locally. But when I tried to deploy the same code for AF deployed on using cloud.astronomer instance using celery executor, Th DAG is failing with this exceptions: raise AirflowException( airflow.exceptions.AirflowException: Cannot execute: spark-submit --master spark://spark-master:7077 --name arrow-spark --verbose --deploy-mode client ./include/scripts/read.py. Error code is: 1

@hicks_dwaynes Ай бұрын

Thanks for lesson. I have a problem - TypeError: _choose_best_model() missing 1 required positional argument: 'ti'. Can you help?) returned values 8, 9 and 6

@MarcLamberti Ай бұрын

Put ti=none in the parameters

@hicks_dwaynes 28 күн бұрын

@@MarcLamberti Thanks, i changed Airflow 1.0 to 2.0 and had success))

@xavierangaleu Ай бұрын

i am trying to upload the csv file to google cloud bucket but am getting a TIMEOUT ERROR. I tried uploading another file that was less heavy it worked. please can someone help me @MarcLamberti

@NamrataAnandkumarHurukadli

@NamrataAnandkumarHurukadli Ай бұрын

Hey Marc, I have deferred tasks in my DAG and when I use the backfill command to it, the task isn't going to deferred state, it's just marking all the tasks as successful ..... does the backfill command not work with deferrable operators ??

@LoveDarkChocolate

@LoveDarkChocolate Ай бұрын

I would like to ask you about the DAG being scheduled on a Dataset or Dataset Alias. I assume that such DAG will only run if those datasets are modified by another DAG, not by any event that modifies those files outside the Airflow architecture?

@MarcLamberti Ай бұрын

Correct

@tomasoon Ай бұрын

Is it possible to run pyspark jobs from azure synapse? (for example a notebook)?

@MarcLamberti Ай бұрын

Didn’t try yet

@forgottenvy Ай бұрын

how to get the compose file with the latest images? Is the doc updated regularly or I grab latest images manually from docker hub and customize my compose file

@MarcLamberti Ай бұрын

use the doc yes :)

@prakashraj4264

@prakashraj4264 Ай бұрын

I’ve one doubt sir what’s the usage of Installing airflow any specific reason for that ? And in what kind of scenario can we use this one ?

@Clement-r7t Ай бұрын

Great tutorial, thanks Marc!

@ashutosh4912 Ай бұрын

Now waiting for video where we deploy multinode worker of spark and airflow with Kubernetes in production 🎉

@MrBestshorty Ай бұрын

Hey great video! Could you show us how to set this up using Kubernetes and helm?

@MarcLamberti Ай бұрын

Tell me more :)

@RushikeshMule-p4b

@RushikeshMule-p4b Ай бұрын

If you're using Apprise, it's possible to tag people on Teams?

@MarcLamberti Ай бұрын

I think so

@ashutosh4912 Ай бұрын

They are in same network i verified using inspect command🍖 able to ping spark-master from airflow webserver as well

@MarcLamberti Ай бұрын

here we go :)

@ashutosh4912 Ай бұрын

Getting this error raise PySparkRuntimeError( pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

@Ferdi-g1q Ай бұрын

Hey Marc, great video! Indeed that seems to be really easy to play around with Spark and Airflow in a dev environment with your explanation, but I have 3 questions though: (1) At the end of the video when you present the TaskFlow API way of triggering a PySpark job, you import pandas at a "top-level import" which is not really recommended if I am not wrong (should be imported within the function of the task)? I get that you need to do so since your function is returning a pandas DataFrame though :). (2) Same context as before: your function returns a pandas DF, it will be returned as a XCOM right? I remember that XCOM used to be used to pass around "little amount" of data from tasks to tasks, but I think this limitation is gone given you can set an XCOM backend to be a cloud object store (GCS, S3, ABFS). But do we really want to do that given the fact that Spark is designed to deal with big amount of data (although I totally get that it can make sense for small amount of data as showcased in your example) ? To rephrase my question: should not it be better to code your PySpark job so that the result is written in a cloud object storage instead of passing it around through XCOM? (3) In a higher environment (prod), would you still deploy Spark as a Kubernetes application (docker in your case but it is a container), or would it better to leverage the fact that PySpark applications can use Kubernetes as a cluster manager? Thanks again for the great video :)

@MarcLamberti Ай бұрын

Lot of great remarks here :) 1) It depends. You have some imports that natually take time (like numpy). My recommendation is to make a local import if you use that import in one task otherwise you can make it top-level IF it's not a heavy import (like numpy again). You can verify that with the parsing time. 2) Absolutely. Here the data is very small so it doesn't matter. But if you have large data, then it's better to offload the work that Spark. That being said, you should always know what you're doing. It's ok to pass/process data through Airflow tasks as long as you know the limitations of your architecture/resources. 4) I prod, I would deploy Spark as a Kubernetes application and use Spark Connect as explained here airflow.apache.org/docs/apache-airflow-providers-apache-spark/stable/decorators/pyspark.html#spark-connect <3

@sseeer-r5d Ай бұрын

Nice, need to YARN + TEZ and HDFS and will be grate e2e project

@fixxer47 Ай бұрын

i dont have memberOf attribute in user's internal atrribute. How did you configure your LDAP so it creates memberOf attribute?

@JohnGunvaldson

@JohnGunvaldson Ай бұрын

I think this is going to work great. I have a hundred or so PHP files to schedule and call from Airflow leveraging SSHOperator, and these calls only differs in name of php file, and sometimes location of php file, all other tasks are the same (solves a legacy issue left over from years past)... I can control scheduling in batches, and arrange for any specifics with properties...

@KarlKloppenborg

@KarlKloppenborg 2 ай бұрын

Loving your energy!

@ruksanabegum1895

@ruksanabegum1895 2 ай бұрын

You are simple "The Best" <3

@starlord9109 2 ай бұрын

Hey.. Marc Now please tell ne how to stop repetition of tasks as no one told in comments 😂

@abhipoornisnsadvocate5158

@abhipoornisnsadvocate5158 2 ай бұрын

Hi Marc, Glue job finishes in 5 minutes but it shows as running in astro airflow local instance

@JyotiMalik-f5u

@JyotiMalik-f5u 2 ай бұрын

I am unable to restart the airflow instance after 29:30 , can anyone plzzzzzz help

@mohd_fawad 2 ай бұрын

Same here. @MarcLamberti would you be able to help us?

@sewingwithcope

@sewingwithcope 2 ай бұрын

Wow this was such a great tutorial! Very easy to understand and can’t wait to try it myself!