How to Submit a PySpark Script to a Spark Cluster Using Airflow!

  Рет қаралды 13,371

The Data Guy

The Data Guy

Күн бұрын

Пікірлер: 64
@JP-zz6ql
@JP-zz6ql Жыл бұрын
Love this new format
@thedataguygeorge
@thedataguygeorge Жыл бұрын
Thanks! Good to know!
@vladimirkotoff4260
@vladimirkotoff4260 Жыл бұрын
Thank you so much for the vid ! Both the format and the subject are awesome !
@thedataguygeorge
@thedataguygeorge Жыл бұрын
Thanks so much Vlad!
@SoumilShah
@SoumilShah 10 ай бұрын
Where are you running spark cluster is it running locally Why did we use spark://master in connection im confused how is docker container connecting to spark cluster running on your laptop ?
@thedataguygeorge
@thedataguygeorge 9 ай бұрын
Yes i'm running Spark locally! That's why I'm able to connect to it that way!
@gautamrishi5391
@gautamrishi5391 7 ай бұрын
@@thedataguygeorge How can i check port as you have used maybe it is different for me or not?
@gpmahendra7129
@gpmahendra7129 Жыл бұрын
Hi Data Guy, the video is so insightful. I am just struck at understanding the Host & Port details you provided while creating the connection in Airflow Webinterface. How to configure those while bringing up the airflow container to submit a pyspark script. Kindly help me with this one.
@thedataguygeorge
@thedataguygeorge 11 ай бұрын
Thanks! What specifically are you having trouble with? Is the Host/Port combo for the Spark connection not working?
@BinPham-x1k
@BinPham-x1k Ай бұрын
you da goat my guy
@thedataguygeorge
@thedataguygeorge Ай бұрын
Thanks big dawg!
@johnr52dev
@johnr52dev Жыл бұрын
Thank you. plz show setup spark cluster
@thedataguygeorge
@thedataguygeorge 11 ай бұрын
Would you like to OSS Spark locally or cloud based?
@KathirVel-fb2sf
@KathirVel-fb2sf 11 ай бұрын
kubernetes based spark job submit is best for real production grade, very less videos on spark airflow kubernetes job submit in dynamic@@thedataguygeorge
@anikethdeshpande8336
@anikethdeshpande8336 5 ай бұрын
@@thedataguygeorge locally
@sandeepnarwal8782
@sandeepnarwal8782 9 ай бұрын
Best Tutorial
@thedataguygeorge
@thedataguygeorge 9 ай бұрын
Thanks Sandeep!
@anikethdeshpande8336
@anikethdeshpande8336 5 ай бұрын
this worked well ! thank you so much 😃
@thedataguygeorge
@thedataguygeorge 5 ай бұрын
Glad it helped!
@pyramidofmerlinii4368
@pyramidofmerlinii4368 9 ай бұрын
I followed you but have one error JAVA_HOME is not set So I have to install Java in my Container?
@thedataguygeorge
@thedataguygeorge 9 ай бұрын
In your Spark cluster yes!
@QuangHungLe
@QuangHungLe 5 ай бұрын
nice vid bro but if i got my airflow installed in docker how can i get the SparkSubmitOperator Package
@thedataguygeorge
@thedataguygeorge 4 ай бұрын
Add it to your requirements file and restart your environment
@hotgoosezion547
@hotgoosezion547 3 ай бұрын
Is it not possible to assign a task to each step of the ETL pipeline? Meaning have the first task be creating the SparkSession, second extract the dataframe, third drop columns, fourth write to output? This would make it so much clearer in Airflow and in the case of it failing you could see exactly where it fails. I hope I'm making sense as it's what I'm currently trying to do but I can't find a solution to it
@thedataguygeorge
@thedataguygeorge 2 ай бұрын
You definitely could, the issue might just be referencing the same Spark session across tasks which is why I chose to put it in a single task since you can just read the Spark logs in Airflow to see the failure point
@ashwanitripathi5035
@ashwanitripathi5035 2 ай бұрын
I am having airflow in containers and spark is running locally. I want to trigger my existing python scripts which earlier I was running on my local using spark-submit command but now when I am trying to run them using airflow dag then script execution is as I checked the spark master ui there the jobid is created and assigned to worker but worker is not able to execute the script further. Can you please help what could be the issue?
@thedataguygeorge
@thedataguygeorge 2 ай бұрын
Do you have an error code you can share?
@TheGuyWith_Scram411
@TheGuyWith_Scram411 8 ай бұрын
Can we connect databricks community edition to airflow ?
@thedataguygeorge
@thedataguygeorge 8 ай бұрын
I believe so but not 100% sure since I know CE is pretty limited!
@kenskyschulz1979
@kenskyschulz1979 11 ай бұрын
I am thinking of any consequences, if using python operator directly and call out a pyspark code within, and fill all the ins and outs within a given task. Do you think this works?
@thedataguygeorge
@thedataguygeorge 11 ай бұрын
Tentatively yes, but what do you mean by fill all the ins and outs of a given task?
@kenskyschulz1979
@kenskyschulz1979 11 ай бұрын
@@thedataguygeorge Thanks for the reply. I meant the input args and in/out data referenced by a Dag / connected tasks.
@Airaselven
@Airaselven 8 ай бұрын
Thank you very much for the useful video.
@thedataguygeorge
@thedataguygeorge 8 ай бұрын
Thanks so much for the appreciation!
@SoumilShah
@SoumilShah Жыл бұрын
Where can I get code ??
@thedataguygeorge
@thedataguygeorge Жыл бұрын
Hadn't hosted it yet but will do!
@naveenkonda395
@naveenkonda395 6 ай бұрын
how can i connect to 2 remote servers at the same time when i have to run multiple jobs
@sildistruttore
@sildistruttore 5 ай бұрын
how am i supposed to send data directly into snowflakes? I mean, if i'm processing Petabytes how am i supposed to retrieve them in my local machine and then push them into snowflakes?
@thedataguygeorge
@thedataguygeorge 5 ай бұрын
You could chunk it by ingesting smaller pieces of the larger data set, or spin up a remote spark cluster with a ton of compute!
@nansambassensalo3065
@nansambassensalo3065 2 ай бұрын
How about a short part 2 running a spark docker image on airflow? I'm not on astro...I'm having all sorts of problems including the spark modules not being recognized despite being installed and manifesting in requirements file, and spark not an option in my connections drop down in Airflow...
@thedataguygeorge
@thedataguygeorge 2 ай бұрын
Oh very interesting, mind sending over the requirements file you're using?
@nansambassensalo3065
@nansambassensalo3065 2 ай бұрын
Thank you so much for your reply. I solved both issues! I forgot how I got Spark to start showing up in Airflow but with Stackoverflow suggestion worked. I got the modules to work by installing them on the jupyter notebook ( docker image ), then it would work on my VSC connected to the jupyetr notebook server. The BIG issue is connecting AWS credentials in a spark docker environment. But I see you have a new video ( that I already watched) on connecting to AWS in airflow which could be a nice work around since credential go in the connection tab and dont have to be pulled from .aws/credentials. I only have 6 more days of my snowflake free trial and I need to make a project happen before then or I'll never get a job! See you in your other comment sections!
@alielzahaby3315
@alielzahaby3315 7 ай бұрын
wow man that astro dev init killed me I am making the docker image myself every time lol is there any video you have to help with the astro package and its tips and tricks?
@thedataguygeorge
@thedataguygeorge 7 ай бұрын
Even better than a video, check out this guide to getting started with Astro!
@yaar_or
@yaar_or 5 ай бұрын
@@thedataguygeorge Hi there! It seems the link for the guide wasn't attached to your comment for some reason... Can you please re-attach it? Thanks! 🙏
@ManishJindalmanisism
@ManishJindalmanisism Жыл бұрын
hi Can you add another episode to it describing how to submit pyspark on a cloud provider cluster like databricks AWS-EMR GCP-Dataproc etc,, because what i have seen is just like astronomer for airflow, people use clusters on cloud providers rather than installing/running spark on bare metal or VMs
@thedataguygeorge
@thedataguygeorge Жыл бұрын
Yeah forsure! I have an Azure Databricks instance available, would that work for ya?
@gautamrishi5391
@gautamrishi5391 7 ай бұрын
getting this error Error when trying to pre-import module 'airflow.providers.apache.spark.operators.spark_submit' found in test_dag.py: No module named 'airflow.providers.apache
@deepanshurathore9661
@deepanshurathore9661 3 ай бұрын
Hi , I'm new in spark and airflow I 'm trying to submit my spark job via airflow .but getting this "spark-submit not configured to wait for completion, exiting spark-submit JVM" it submitted the job and job is finished but its not waiting for long ..
@mohammedmahaboobsharieff9857
@mohammedmahaboobsharieff9857 2 ай бұрын
facing same issue. Did you resolve it? If so how to fix it.
@thedataguygeorge
@thedataguygeorge 2 ай бұрын
You need to enable the wait-for-completion flag so it knows to monitor the job after triggering!
@SonuKumar-fn1gn
@SonuKumar-fn1gn 4 ай бұрын
Please use kubernetes for all these environments like spark and airflow and submit the job I'm requesting you 🙏🙏🙏🙏🙏🙏🙏
@thedataguygeorge
@thedataguygeorge 4 ай бұрын
These are all running on docker w/ kubernetes engine! Airflow runs on docker and also runs pyspark in the containers to process the data, but you could easily switch the host to a Spark container and deploy your spark engine there.
@vittalshanbag2892
@vittalshanbag2892 Жыл бұрын
Hi the data guy make some knowledge transfer session on spark structured streaming with kafka integration pls. Thank you 😅
@thedataguygeorge
@thedataguygeorge Жыл бұрын
Definitely, I'll put it in the schedule!
@adeyinkaadegbenro9645
@adeyinkaadegbenro9645 Жыл бұрын
Thank you for the video, Kindly assist with this error: Cannot execute: spark-submit --master yarn --num-executors 1 --total-executor-cores 1 --executor-cores 1 --executor-memory 2g --driver-memory 2g --name arrow-spark --queue root.default sparksubmit_basic.py. Error code is: 1.; 96442
@thedataguygeorge
@thedataguygeorge 11 ай бұрын
Could you please give me a little more context?
@79texx
@79texx Жыл бұрын
Hey man, as a fellow data analyst I really enjoy your videos, but you gotta stop those premiers… it’s really annoying to see an interesting video in my subfeed and then everything it’s just a premier for like in 5 days… I mean of course you can use this feature once in a while but please not 5 times a week, it’s really frustrating.
@thedataguygeorge
@thedataguygeorge Жыл бұрын
Ok good to know, will stop! Sorry about that, I just did it so people could see the upcoming schedule but didn't realize it was so prominent in your feed!
@79texx
@79texx Жыл бұрын
@@thedataguygeorgeThanks man!
@viralstingray5590
@viralstingray5590 Жыл бұрын
I totally agree! Love the content and the 10min short and no-bs is just amazing. But the scheduled thing is very annoying
@thedataguygeorge
@thedataguygeorge Жыл бұрын
No problem, thanks for letting me know!
@pattekhaashwak8167
@pattekhaashwak8167 8 ай бұрын
Hi how do we setup connection between two dockers so that airflow can trigger task in spark docker. Can you make video on it
When Rosé has a fake Fun Bot music box 😁
00:23
BigSchool
Рет қаралды 4,9 МЛН
Арыстанның айқасы, Тәуіржанның шайқасы!
25:51
QosLike / ҚосЛайк / Косылайық
Рет қаралды 676 М.
Master Apache Spark on Kubernetes and Beyond!
29:32
DataPains
Рет қаралды 1,4 М.
Creating Your First Airflow DAG for External Python Scripts
8:26
Vincent Stevenson
Рет қаралды 56 М.
How to submit Spark jobs to EMR cluster from Airflow
14:38
StartDataEngineering
Рет қаралды 12 М.
Intro to Amazon EMR - Big Data Tutorial using Spark
22:02
jayzern
Рет қаралды 33 М.
Running Airflow 2.0 with Docker in 5 mins
11:55
Data with Marc
Рет қаралды 174 М.