Data engineer interview question | Process 100 GB of data in Spark Spark | Number of Executors

  Рет қаралды 28,160

MANISH KUMAR

MANISH KUMAR

Жыл бұрын

In this video, we have discussed how to process 100 GB of data in spark. This is one the famous question asked during interview for data engineering role.
Directly connect with me on:- topmate.io/manish_kumar25
For more queries reach out to me on my below social media handle.
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

Пікірлер: 46
@neelbanerjee7875
@neelbanerjee7875 Жыл бұрын
You are actually filling the gap.. much thanks man..!! request you to kindly make this kind of interactive videos specially on below topics - 1. Repartition with real time scenario. How to determine repartition size depending on data size, cluster size 2. Key salting method - practical/real time case with coading example 3. Data serialization in spark and how it helps on optimization 4. Choosing file type on different scenario (parquet/json/orc) 5. DAG analysis 6. Accumulator - with real time use cases 7. Cache and persist - when to use what 8. garbage collection tuning 9. Real time coding issues faced by data engineers and debugging 10. Version control system for databricks notebook 11. Real time production implementation of bigdata projects.. 12. How to perform unit testing for databricks notebooks? Thanks in advance.. ❤❤
@sohelsayyad5572
@sohelsayyad5572 Жыл бұрын
Thank you Manish Bhai, you understand what matters to the aspiring data engineers and what they need to know in depth. really appreciate this.
@sandeepsoni6628
@sandeepsoni6628 Жыл бұрын
Best Channel for data engineer 👍👍
@pratikj2538
@pratikj2538 Жыл бұрын
Mihir is just bluffing and saying the generic stories. Manish did a good job by interrupting hime. Keep it up.
@sauravsawant2818
@sauravsawant2818 6 ай бұрын
I don't think barabar hi bolraha hai
@manojkumar-kq1nc
@manojkumar-kq1nc Жыл бұрын
Awesome
@electricalsir
@electricalsir 9 ай бұрын
Thank you ❤
@sanooosai
@sanooosai 4 ай бұрын
thank you sir
@mranaljadhav8259
@mranaljadhav8259 Жыл бұрын
Thanks manish for this informative session.. I already had this question in my mind.. I was searching for this question from few days...finally Today you made this video...its like a magic...Thnks a lot man...Please make more videos on such questions which are asked in interview.
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Sure
@asktostranger8296
@asktostranger8296 10 ай бұрын
Bahi data engineering field me remote jobs Bhi he ??us Remoye jobs?
@siddharthguliyani4032
@siddharthguliyani4032 Жыл бұрын
Yeh channel ki reach m aag lagne vali hai bahut tej , kaafi tez upr uthega yeh. Likh ke lelo.
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Thank you so much for lovely comments
@AnmolKumar-wd1rv
@AnmolKumar-wd1rv Жыл бұрын
Tum mu me lelo
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Directly connect with me on:- topmate.io/manish_kumar25
@rh334
@rh334 10 ай бұрын
The interviewer asked me about processing PETABYTES of data. Can you explain how to deal with that scenario
@manish_kumar_1
@manish_kumar_1 10 ай бұрын
Uske liye bahut saari chije consider karni paregi. Sabse pahle unka cluster size kya hai. Uspar depend karega ki kitna time lagega and kaise ham proceed karenge.
@rishav144
@rishav144 Жыл бұрын
great video ...Can u make a video on What projects should fresher make for Data Engineer role ?
@jparmar1
@jparmar1 Жыл бұрын
Brother check data engineering zoomcamp.
@rishav144
@rishav144 Жыл бұрын
@@jparmar1 are u talking about YT channel DataTalksClub , its available there or some other resource ?
@jparmar1
@jparmar1 Жыл бұрын
@@rishav144 Yes the same. They have a github repository, so follow the steps as shown there and the tutorials are on youtube.
@mnaveenvamshi3651
@mnaveenvamshi3651 Жыл бұрын
Thank you very much Manish for your guidance, it is really helpful i am ur new subscriber, my query is , I am good at python developer and intermediate SQL i know, but very much new to spark, i had learnt the spark basics, but can you suggest me one course from where I can learn like this real time questions on spark to process 100 GB data is there any resources in udemy or any other places Thanks in advance, as if i want to career change from python developer to data Engineer
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Learn the hard way. Don't look for shortcut. Download the dataset from kaggle and work on that data by yourself. Transform it and write it back to hdfs or any cloud storage bucket.
@mnaveenvamshi3651
@mnaveenvamshi3651 Жыл бұрын
@@manish_kumar_1thanks for the suggestions, surely I will follow.
@jaychavhan6267
@jaychavhan6267 Жыл бұрын
how much data structure needed for data engineer and how to learn plz make video on this topic...
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Sure
@RaviKumar-gv2wo
@RaviKumar-gv2wo 10 ай бұрын
This is not a correct approach i believe . To process 100 gb of data, block size created would be 800 . We would need more executors to run in parallel . If we rely on the resources explained, it will take much more time than expected.
@FlashGG1
@FlashGG1 Жыл бұрын
Hi ​ @MANISH KUMAR As per Mihir first approach >> 4:03 he is considering 5 executors with 2 cores each and 10gb memory/executor. In this case, 5*2 = 10 cores in total(10 parallel processes) and 10gb* 5 = 50 gb in total memory I think 5 executors with the above mentioned configuration will not handle 100gb of data. It can only handle 50 gb. Correct me if I am wrong. The calculation mentioned at the end >> 10:53 5 to 6 executors and 4 cores each and 15gb ram/executor seems fine.
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
You don't need to load the entire data in memory in one go. Suppose you had 30 GB of memory left in cluster then do you think it will fail? Actually it won't, rather it will take more time to process. There is a trade off between memory utilization and time. If you load the entire data in one go then it will take less time to process but if you use less memory then it will take more time to process.
@fit1801
@fit1801 10 ай бұрын
That 10gb of each executor will be divided in various parts. Approx 300 mb reserved memory then remaining memory will be divided into 40:60 ratio. 40 percent of 10gb will be used as user memory to Store user defined variables or data. 60 percent of 10 gb will be used as Spark memory n it will be divided into 50:50 ration to storage memory n execution memory. If I calculate roughly 3gb will be used as execution memory n it will be divided between number of core that's 2. So each core will be able to handle 1.5 gb of data. However we will be handling the partition of 128 mb. We total have 10 core , so 10 partition can be processed in parallel. Total partition is 800 so 80 Cycles would be there to process our 100 gb data. Suppose one cycle takes 10 seconds then the total time would be 10*80 seconds. It's a roughly calculation and possibility may vary according to the resource availability. Hope it's helpful 🙏 I learnt this from Sumit Mittal
@ShivaKumar-dj8bj
@ShivaKumar-dj8bj 3 ай бұрын
@@fit1801 Before even coming to the end of your comment, I guessed that you have enrolled in Sumit Sir's course ;) Nice explanation bro, I too had same kind of explanation for this scenario
@surajitpaul6956
@surajitpaul6956 3 ай бұрын
​@@fit1801is Sumit mittal course good. I am having 8+ years of experience in non tech I want to transition to DE. Will the course help me make the transition? Thanks in advance. 😊
@kyou1502
@kyou1502 7 ай бұрын
memories are not calculated as guess work what he is doing in interview. There is a proper formula to calculate no of executors,cors and memory
@manish_kumar_1
@manish_kumar_1 7 ай бұрын
Can you please put write that formula here so that everyone can get benefitted
@anupamakamepalli5285
@anupamakamepalli5285 6 ай бұрын
Can someone the post the content in English too
@raviyadav-dt1tb
@raviyadav-dt1tb 7 ай бұрын
I unable to understand this question
@ameygoesgaming8793
@ameygoesgaming8793 5 ай бұрын
I don't think he answered question correctly and he was confident before you asking some doubt, but I think interviews me ye jawab nahi chalega, because he was moving his answers around resources, and business and all. BUt that was not asked. @Manish bhai please, would like to know your approach to this question with calculations
@anjibabumakkena
@anjibabumakkena Жыл бұрын
Expalin in english
@girishnigade8115
@girishnigade8115 Жыл бұрын
One yr study krke dada Engineer ban sakte hai kya sir...?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Bilkul
@user-dv1ry5cs7e
@user-dv1ry5cs7e 4 ай бұрын
answering roughly and just bluffing
@adityakishan1
@adityakishan1 7 ай бұрын
Solution dene se zyada bandaa bakaiti kar rha hai
@manish7897
@manish7897 Жыл бұрын
Where to learn these in depth spark architecture... Any resources/book you'll suggest ?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Spark the definitive guide book
@karansinghrajpurohit3500
@karansinghrajpurohit3500 Жыл бұрын
Bhai aapka Instagram ya Gmail?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Video ke description me hai
data engineer interview questions
23:54
MANISH KUMAR
Рет қаралды 25 М.
Stay on your way 🛤️✨
00:34
A4
Рет қаралды 14 МЛН
Now THIS is entertainment! 🤣
00:59
America's Got Talent
Рет қаралды 40 МЛН
Самый Молодой Актёр Без Оскара 😂
00:13
Глеб Рандалайнен
Рет қаралды 11 МЛН
12 offer letters strategy revealed | data engineer roadmap 2023
24:41
walmart interview questions and answers | Data Engineering
21:13
MANISH KUMAR
Рет қаралды 29 М.
4 Recently asked Pyspark Coding Questions | Apache Spark Interview
28:39
pwc actual interview questions | data engineer interview questions
14:28