Thank you so much for the vid ! Both the format and the subject are awesome !
@thedataguygeorge Жыл бұрын
Thanks so much Vlad!
@SoumilShah10 ай бұрын
Where are you running spark cluster is it running locally Why did we use spark://master in connection im confused how is docker container connecting to spark cluster running on your laptop ?
@thedataguygeorge9 ай бұрын
Yes i'm running Spark locally! That's why I'm able to connect to it that way!
@gautamrishi53917 ай бұрын
@@thedataguygeorge How can i check port as you have used maybe it is different for me or not?
@gpmahendra7129 Жыл бұрын
Hi Data Guy, the video is so insightful. I am just struck at understanding the Host & Port details you provided while creating the connection in Airflow Webinterface. How to configure those while bringing up the airflow container to submit a pyspark script. Kindly help me with this one.
@thedataguygeorge11 ай бұрын
Thanks! What specifically are you having trouble with? Is the Host/Port combo for the Spark connection not working?
@BinPham-x1kАй бұрын
you da goat my guy
@thedataguygeorgeАй бұрын
Thanks big dawg!
@johnr52dev Жыл бұрын
Thank you. plz show setup spark cluster
@thedataguygeorge11 ай бұрын
Would you like to OSS Spark locally or cloud based?
@KathirVel-fb2sf11 ай бұрын
kubernetes based spark job submit is best for real production grade, very less videos on spark airflow kubernetes job submit in dynamic@@thedataguygeorge
@anikethdeshpande83365 ай бұрын
@@thedataguygeorge locally
@sandeepnarwal87829 ай бұрын
Best Tutorial
@thedataguygeorge9 ай бұрын
Thanks Sandeep!
@anikethdeshpande83365 ай бұрын
this worked well ! thank you so much 😃
@thedataguygeorge5 ай бұрын
Glad it helped!
@pyramidofmerlinii43689 ай бұрын
I followed you but have one error JAVA_HOME is not set So I have to install Java in my Container?
@thedataguygeorge9 ай бұрын
In your Spark cluster yes!
@QuangHungLe5 ай бұрын
nice vid bro but if i got my airflow installed in docker how can i get the SparkSubmitOperator Package
@thedataguygeorge4 ай бұрын
Add it to your requirements file and restart your environment
@hotgoosezion5473 ай бұрын
Is it not possible to assign a task to each step of the ETL pipeline? Meaning have the first task be creating the SparkSession, second extract the dataframe, third drop columns, fourth write to output? This would make it so much clearer in Airflow and in the case of it failing you could see exactly where it fails. I hope I'm making sense as it's what I'm currently trying to do but I can't find a solution to it
@thedataguygeorge2 ай бұрын
You definitely could, the issue might just be referencing the same Spark session across tasks which is why I chose to put it in a single task since you can just read the Spark logs in Airflow to see the failure point
@ashwanitripathi50352 ай бұрын
I am having airflow in containers and spark is running locally. I want to trigger my existing python scripts which earlier I was running on my local using spark-submit command but now when I am trying to run them using airflow dag then script execution is as I checked the spark master ui there the jobid is created and assigned to worker but worker is not able to execute the script further. Can you please help what could be the issue?
@thedataguygeorge2 ай бұрын
Do you have an error code you can share?
@TheGuyWith_Scram4118 ай бұрын
Can we connect databricks community edition to airflow ?
@thedataguygeorge8 ай бұрын
I believe so but not 100% sure since I know CE is pretty limited!
@kenskyschulz197911 ай бұрын
I am thinking of any consequences, if using python operator directly and call out a pyspark code within, and fill all the ins and outs within a given task. Do you think this works?
@thedataguygeorge11 ай бұрын
Tentatively yes, but what do you mean by fill all the ins and outs of a given task?
@kenskyschulz197911 ай бұрын
@@thedataguygeorge Thanks for the reply. I meant the input args and in/out data referenced by a Dag / connected tasks.
@Airaselven8 ай бұрын
Thank you very much for the useful video.
@thedataguygeorge8 ай бұрын
Thanks so much for the appreciation!
@SoumilShah Жыл бұрын
Where can I get code ??
@thedataguygeorge Жыл бұрын
Hadn't hosted it yet but will do!
@naveenkonda3956 ай бұрын
how can i connect to 2 remote servers at the same time when i have to run multiple jobs
@sildistruttore5 ай бұрын
how am i supposed to send data directly into snowflakes? I mean, if i'm processing Petabytes how am i supposed to retrieve them in my local machine and then push them into snowflakes?
@thedataguygeorge5 ай бұрын
You could chunk it by ingesting smaller pieces of the larger data set, or spin up a remote spark cluster with a ton of compute!
@nansambassensalo30652 ай бұрын
How about a short part 2 running a spark docker image on airflow? I'm not on astro...I'm having all sorts of problems including the spark modules not being recognized despite being installed and manifesting in requirements file, and spark not an option in my connections drop down in Airflow...
@thedataguygeorge2 ай бұрын
Oh very interesting, mind sending over the requirements file you're using?
@nansambassensalo30652 ай бұрын
Thank you so much for your reply. I solved both issues! I forgot how I got Spark to start showing up in Airflow but with Stackoverflow suggestion worked. I got the modules to work by installing them on the jupyter notebook ( docker image ), then it would work on my VSC connected to the jupyetr notebook server. The BIG issue is connecting AWS credentials in a spark docker environment. But I see you have a new video ( that I already watched) on connecting to AWS in airflow which could be a nice work around since credential go in the connection tab and dont have to be pulled from .aws/credentials. I only have 6 more days of my snowflake free trial and I need to make a project happen before then or I'll never get a job! See you in your other comment sections!
@alielzahaby33157 ай бұрын
wow man that astro dev init killed me I am making the docker image myself every time lol is there any video you have to help with the astro package and its tips and tricks?
@thedataguygeorge7 ай бұрын
Even better than a video, check out this guide to getting started with Astro!
@yaar_or5 ай бұрын
@@thedataguygeorge Hi there! It seems the link for the guide wasn't attached to your comment for some reason... Can you please re-attach it? Thanks! 🙏
@ManishJindalmanisism Жыл бұрын
hi Can you add another episode to it describing how to submit pyspark on a cloud provider cluster like databricks AWS-EMR GCP-Dataproc etc,, because what i have seen is just like astronomer for airflow, people use clusters on cloud providers rather than installing/running spark on bare metal or VMs
@thedataguygeorge Жыл бұрын
Yeah forsure! I have an Azure Databricks instance available, would that work for ya?
@gautamrishi53917 ай бұрын
getting this error Error when trying to pre-import module 'airflow.providers.apache.spark.operators.spark_submit' found in test_dag.py: No module named 'airflow.providers.apache
@deepanshurathore96613 ай бұрын
Hi , I'm new in spark and airflow I 'm trying to submit my spark job via airflow .but getting this "spark-submit not configured to wait for completion, exiting spark-submit JVM" it submitted the job and job is finished but its not waiting for long ..
@mohammedmahaboobsharieff98572 ай бұрын
facing same issue. Did you resolve it? If so how to fix it.
@thedataguygeorge2 ай бұрын
You need to enable the wait-for-completion flag so it knows to monitor the job after triggering!
@SonuKumar-fn1gn4 ай бұрын
Please use kubernetes for all these environments like spark and airflow and submit the job I'm requesting you 🙏🙏🙏🙏🙏🙏🙏
@thedataguygeorge4 ай бұрын
These are all running on docker w/ kubernetes engine! Airflow runs on docker and also runs pyspark in the containers to process the data, but you could easily switch the host to a Spark container and deploy your spark engine there.
@vittalshanbag2892 Жыл бұрын
Hi the data guy make some knowledge transfer session on spark structured streaming with kafka integration pls. Thank you 😅
@thedataguygeorge Жыл бұрын
Definitely, I'll put it in the schedule!
@adeyinkaadegbenro9645 Жыл бұрын
Thank you for the video, Kindly assist with this error: Cannot execute: spark-submit --master yarn --num-executors 1 --total-executor-cores 1 --executor-cores 1 --executor-memory 2g --driver-memory 2g --name arrow-spark --queue root.default sparksubmit_basic.py. Error code is: 1.; 96442
@thedataguygeorge11 ай бұрын
Could you please give me a little more context?
@79texx Жыл бұрын
Hey man, as a fellow data analyst I really enjoy your videos, but you gotta stop those premiers… it’s really annoying to see an interesting video in my subfeed and then everything it’s just a premier for like in 5 days… I mean of course you can use this feature once in a while but please not 5 times a week, it’s really frustrating.
@thedataguygeorge Жыл бұрын
Ok good to know, will stop! Sorry about that, I just did it so people could see the upcoming schedule but didn't realize it was so prominent in your feed!
@79texx Жыл бұрын
@@thedataguygeorgeThanks man!
@viralstingray5590 Жыл бұрын
I totally agree! Love the content and the 10min short and no-bs is just amazing. But the scheduled thing is very annoying
@thedataguygeorge Жыл бұрын
No problem, thanks for letting me know!
@pattekhaashwak81678 ай бұрын
Hi how do we setup connection between two dockers so that airflow can trigger task in spark docker. Can you make video on it