Big Data Hadoop Spark Cluster on AWS EMR Cloud | Big Data on AWS Cloud

Big Data Hadoop Spark Cluster on AWS EMR Cloud | Big Data on AWS Cloud | Production Big Data Cluster

Рет қаралды 53,018

Sumit Mittal

Күн бұрын

Пікірлер: 64

@sumitmittal07 2 жыл бұрын

Checkout the Big Data course details here: trendytech.in/?referrer=youtube_bd22

@gebrilyoussef6851 Ай бұрын

Sumit, you are the master trainer of Big Data. Thank you so much for all the efforts you made.

@anuragdubey5898 2 жыл бұрын

Very informative session. Have learnt a lot and even cleared my doubts as well. Easy and simplified way of explanation made it a best video for AWS use in Bigdata. Thanks for the session..

@NaturalPro100 4 жыл бұрын

This really cleared some basics I required for starting spark with AWS.Content and explanation is up to the point.Thanks for sharing Sumit.

@NIHAL960 4 жыл бұрын

S3: Amazon storage On demand instance : Available on demand Spot instance : Available at high discounts for temporary basis, can be taken back with 2 min warning Reserved instance : Available if commitment is long such as a year at discounted price as compared to on demand Types of nodes: 1. Master Node: This manages the cluster. This is single ec2 instance. 2. Core Node: Each cluster has one or more core node, It hosts data and runs tasks 3. Task Node: This can only run task and not store,, Required if application is compute heavy. Spot instance are good choice for it. Cluster: 1 Transient cluster terminates automatically. 2. long running cluster requires manual termination.

@sumitmittal07 4 жыл бұрын

Nice Summarization.. thanks much

@sampaar 3 жыл бұрын

Amazing presentation. Better than many of the udemy courses that I have come across.

@mdabdulmujeebmalik422 4 жыл бұрын

Excellent video on AWS and how to run spark job on AWS. Amazing!!. Thank You so much for the video and kudos to the instructor.

@udaynayak4788 2 жыл бұрын

one of the best informative session , thank you so much for sharing.

@datoalavista581 2 жыл бұрын

Thank you for sharing

@VallabhGhodkeB 2 жыл бұрын

Top Stuff this is. just got started way to go

@amitbajpai6209 4 жыл бұрын

Best video to get an overall understanding of AWS EMR.. It was really helpful 😊 Kudos to the Instructor !! Liked 👍 and Subscribed.. Hoping for more such videos..

@sumitmittal07 4 жыл бұрын

Glad it was helpful!

@kashamp9388 2 жыл бұрын

best session ever. concise

@sumitmittal07 2 жыл бұрын

Glad you are liking the my teaching :)

@divakarluffy3773 2 жыл бұрын

one video resolved all my doubts , Thanks

@sumitmittal07 2 жыл бұрын

Happy to hear that your doubts are resolved

@laxmisuresh 3 жыл бұрын

Very meaningful presentation. Explained in correct pace and with proper content.

@swaroupbanikk4444 2 жыл бұрын

BEST

@gauravrai4398 4 жыл бұрын

Very lucid and concise explanation .... A job well done!

@sumitmittal07 4 жыл бұрын

Thank you Gaurav

@ririraman7 2 жыл бұрын

beautiful

@vairammoorthy6665 4 жыл бұрын

best tutorial for AWS EMR

@sridharreddy9605 4 жыл бұрын

very clear explaination thank you for your time...

@subratakr5353 4 жыл бұрын

Thanks for the lovely presentation! Had 2 questions though : 1) When you say your are running code in master do you mean namenode of the cluster ? Where is the namenode for this this cluster ? 2) Since data is stored in S3 does EMR copy it to hdfs and then spark reads from hdfs eventually ? in which hdfs path is the data stored ?

@vijeandran 3 жыл бұрын

Answer 1: Here namenode, driver node, edge node and master node all are the same. 2. As soon as you create one master and 2 slave nodes, These slave node's harddisk behaves like HDFS, and spark will fetch files from this disk and run as in memory of the slave nodes. Here 1 master and slave acts as a processing unit that is part inside AWS. S3 also is a part of AWS and it is a storage unit. When you want to process data, you are copying the data file from storage part s3 to processing part HDFS, where HDFS is present in the 2 slave nodes that you created. then use can run spark jar file.

@sohailhosseini2266 2 жыл бұрын

Thanks for the video!

@pankajnakil6173 3 жыл бұрын

Very useful & good explanation..

@dineshughade6570 Жыл бұрын

Nice explanation. Can we have a pdf of this video?

@RaviKumar-oy5jq 4 жыл бұрын

Excellent session ..

@ramprasadbandari8195 4 жыл бұрын

Excellent explanation and very useful Info!!

@sumitmittal07 4 жыл бұрын

Glad you liked it

@puneetnaik8719 4 жыл бұрын

Great explanation sir..thanks for video.

@shilparathore8849 4 жыл бұрын

Very well explained thanks for sharing

@Dyslexic_Neuron 4 жыл бұрын

very good explanation . Can u make a video on spark shuffle and issues

@vijeandran 3 жыл бұрын

Neat explanation.... and very very informative video....

@AparnaBL 4 жыл бұрын

Moreover hdfs data is ephemeral right ...if you want the data to exist even after cluster is terminated ...we can use S3

@sumitmittal07 4 жыл бұрын

absolutely. you can see same thing is mentioned around 24th minute of the session

@AparnaBL 4 жыл бұрын

@@sumitmittal07 yeah @ 22:36

@gaurav1825 4 жыл бұрын

Sir please give some guidance of AWS EMR with Apache Flink and Hudi .

@dharmeswaranparamasivam5498 4 жыл бұрын

Very good session. Thanks for doing this.

@keyursolanki 11 ай бұрын

will there be default allow access to s3 from emr cluster?

@BinduReddy-n1q Жыл бұрын

How to save the wordcount output in HDFS and also in S3.

@sancharighosh8204 3 жыл бұрын

Can you make some tutorials on Databricks

@amulmgr 3 жыл бұрын

thankyou very much for video

@fzgarcia 4 жыл бұрын

Thank you, nice presentation!

@diptyojha174 4 жыл бұрын

Very nice explanation

@sumitmittal07 4 жыл бұрын

thank you Dipty

@piby1802 4 жыл бұрын

Really nice presentation! Thank you!

@sumitmittal07 4 жыл бұрын

Glad you liked it!

@Naveen-xi7os 4 жыл бұрын

it was awesome session

@anuj3922 3 жыл бұрын

EMR cluster is on hourly rate --if. I don't use it do I still have to pay for it--if I build it just for learning purpose and come back to it as per my learning scope ?

@techtransform 3 жыл бұрын

Excellent Explanation :)

@fzgarcia 4 жыл бұрын

Do you know if in free tier account I can run a EMR cluster like this? Even if I can only run micro t3 in free tier, I can create a manual cluster with minimum 3 nodes of micro t3 or more nodes? Thanks.

@rrjishan 3 жыл бұрын

as we say , on amazon aws we can shut down the cluster after computation and data will be saved in s3 . So, clusters only responsible to compute data? isn't data also stored in clusters. Getting bit confusing..please clear it

@priyabhatia4107 4 жыл бұрын

Great content!!

@SpiritOfIndiaaa 4 жыл бұрын

Thanks , but why hdfs data gone when cluster shutdown ? as hdfs is persistant when cluster is up it would be automatically available right ?

@vijeandran 3 жыл бұрын

When you start the cluster you are creating three instances... one for master and two for datanode. These nodes are available only for that session, because it is virtual only for that session, once you terminate the cluster, the instance created 1 master and 2 slave will be killed and due to that data present in the HDFS will be deleted. As summit said if you want to run your cluster continuously then the data would be available in HDFS, where the amazon will put more bill for continuous usage of cluster.

@rajsekhargada9212 2 жыл бұрын

I think S3 is not distributed storage

@sumitmittal07 2 жыл бұрын

its a object store, but in this scenario its a replacement of distributed storage and serving similar usecase.

@satishj801 2 жыл бұрын

@1:14:30 , he downloaded the jar from S3 but he didn't copy it to hdfs just like he copied book-data.txt and he mentioned he is running the jar from hdfs not from S3 , but its the same step as @ 1:06:52 , I'm bit confused at that point . If some one has understood please drop a reply.

@user-co8oc1rm5w 2 жыл бұрын

jar file he kept in root path of the cluster but the file to access for processing that he kept in hdfs e.g the directory which he created named '/data'. thats y he menioned he is running the jar from hdfs because the file to be processed i.e. book-data.txt he downloaded to hdfs in place of s3 then he changed the file location in scala code then recreated jar and placed that jar to s3 first then downloaded that jar from s3 to master node and executed spark job to process the book-data.txt file from hdfs not from s3.

@satishj801 2 жыл бұрын

@@user-co8oc1rm5w Thanks for the explanation👌🏻