Spark Join | Sort vs Shuffle | Spark Interview Question

Spark Join | Sort vs Shuffle | Spark Interview Question | Lec-13

Рет қаралды 30,863

MANISH KUMAR

Күн бұрын

Пікірлер: 94

@jatinchugh6752 Жыл бұрын

Bhai itni information to aj tak kisi paid course me bhi nahi mili, thank you so much.

@DivyaSharma-ux4mo 7 ай бұрын

This is so true, admire your hard work.!!!!!!

@hankeepankee5361 4 ай бұрын

18:55 Tradeoff between cpu usage (shuffle sort join) and in-memory usage (shuffle hash join)

@hankeepankee5361 6 ай бұрын

Good work bro... In hash join creating the hash would take O(N) -> N being number of unique values in the column. So hash join would take O(N) vs sort join which is O (NlogN)

@AnimeOverload15 Ай бұрын

I have not got a detailed video more than this in my entire career

@jagannathsahoo8297 28 күн бұрын

excellent explanation. and that too free of cost ☺

@mrinalraj4801 7 ай бұрын

Great in depth concepts. Maja aa gya. You are genius. Thanks a lot. Keep up the great work you're doing for the community.

@anuragdwivedi1804 4 күн бұрын

bro i have never seen a detailed vedio like this

@mprtech315 8 ай бұрын

I follow your both spark series. Really its valuable for me 🎉 thanks

@sambitmohanty1758 Жыл бұрын

Hi Manish, your content are amazing, keep it up.

@manojkaransingh5848 Жыл бұрын

amazing...!!!!!!!!! ...video bhaiii....@@@

@udittiwari8420 10 ай бұрын

Thank you sir for the detailed series! Your clear explanations have been incredibly helpful in my learning journey.

@prabhatsingh7391 Жыл бұрын

Hi Manish Bhaiya, here we perform the join based on key= id that is an integer so we can see that id%200 is the partition number where data will go ,but if key= string then ,how it will happens or in that scene internally spark create a key for each column.

@manish_kumar_1 Жыл бұрын

Murmur3 hashing is applied for strings. If you want to know more then check how murmur3 works

@divyanshusingh3966 2 ай бұрын

Thank you bro for providing quality content for free

@voice6905 4 ай бұрын

Apko KOTI KOTI PRANAM GURU JI! Please bring playlists on Apache AIRFLOW and Apache KAFKA. I'm sure they would be the best resource in the KZbin

@venkatmunna8918 Ай бұрын

Thank you so much for the detailed explanation, However, I am confused about one point. Could you please clarify my question? Let's say we don't have the color coding as blue and red. Now, executor-1 has 200 partitions and executor-2 also has 200 partitions. If we consider id=102, then 102/200 = 102. How does spark determine whether the record 102 should go toexecutor-1/executor-2 ? This is discussed at the 10:56 timestamp. Thanks!

@rohitsharma-mg7hd Ай бұрын

bhai pehle to ye batao 102/200=102 kaise ho gaya ? maths ati hai

@shivaog007 Ай бұрын

@@rohitsharma-mg7hd we are taking the remainder

@rohitsharma-mg7hd Ай бұрын

@@shivaog007 ha bhai bataya unhone , mereko galatfehmi ho gai thi hui hui

@Daily_Code_Challenge 13 күн бұрын

executor 1 is taking 1 to 100 and executor 2 101 to 200 colour is showing 2 different table (df1 ,df 2)is created per executor

@adityakvs3529 Ай бұрын

bhai hash table is created at individual partiton or entire data frame in shuffle hash join

@younevano 19 күн бұрын

Partition level

@rajnandinipadhy2533 Жыл бұрын

so if in interview recuirter will ask what kind of join you are performing then should we say as per the data we need to analyze first what kind join should be appropriate for this or we should as spark will do the optimization internally?

@manish_kumar_1 Жыл бұрын

You can talk about types of join strategy and then give a comparison between 2 by taking some dataframe size. If interviewer further asks anything then only explain in detail.

@anuragdwivedi1804 3 күн бұрын

bro can you please tell what book do you follow for spark?

@abhigyanprakash5603 4 күн бұрын

One doubt: You explained joining on the basis of id column where you showed 1/200 gives remainder as 1 --> So, You placed the record in executor 1 with P1... Similarly 109/200 gives remained 109 --> So, You placed the record in P109. But Now assume instead of joining the records based on integer column, we are joining records based on String (Char or VARCHAR datatype). Then, how will this thing work ?

@anish_bhateja Жыл бұрын

Hi Manish, Excellent explanation. Thanks for the informative video.

@maurifkhan3029 Жыл бұрын

is it like every dataframe is split into 200 partitions before shuffling (based on number of shuffle partitions set) ? or is it like if we have 2 Dataframe to join each will get only 100 shuffle partitions

@manish_kumar_1 Жыл бұрын

No it's not like ki every dataframe will get 100. So based on joining condition 200 partitions will get created. And then you can consider 200 bucket is there and every bucket has the same joining key records. Let say df1 had id 5 is in box no 5 then from df2 also id 5 will come to box5 and then box5 is self sufficient to join.

@HanuamnthReddy 10 ай бұрын

Really exemplary 🎉

@khurshidhasankhan4700 Жыл бұрын

Could you please ek video class and case class pr video Bana dijiye maximum interview me puch raha hai

@lakshya1375 Жыл бұрын

Bhai Optimization technique bataya h kya aapne kisi video me?

@Nomanqureshi2204 8 ай бұрын

sir spark streaming par video banaiye

@sachindubey4315 Жыл бұрын

how these 200 partitions spilited into 2 executor ? what if there is 3 or 4 executor are there how split of 200 partiton will be heppen ? ?

@manish_kumar_1 Жыл бұрын

Then partition will be distributed over 4 executors

@prathapganesh7021 Жыл бұрын

Hi you said 100 partitions in each executor but in one executor you demonstrate blue and red in one executor counts 200 could you please elaborate that. Thank you

@diksha.chaudhary Жыл бұрын

hey Manish, your videos are amazing!! 👏 love the way you explain each and every detail. thankyou for sharing your knowledge and keep it up. ✨️

@vikashroy5882 5 ай бұрын

Hi Manish If we follow the approach mentioned at this timestamp 9:28 , then in which partition data will go if we have 0 remainder. Ex- if we have Id as 200 or multiple of 200

@Daily_Code_Challenge 13 күн бұрын

2nd

@prathapganesh7021 Жыл бұрын

Thank you great explanation 🙏

@vishaljoshi1752 Жыл бұрын

hi manish as you said sorting is nlogn and what if we combine the data suppose p1 of table has id 1 and p2 has id 1,1 then if we combine two for loops are required for this then complexity n2 .. is it perform in the same way?

@rishavsharma5732 2 ай бұрын

Baal kharab hogaya..xD, nice work btw..these videos are really helpful.

@aashishraja-k7u 3 ай бұрын

well explained

@nityabajpai2022 Жыл бұрын

Hi Manish, I have few questions : 1. We are applying join on partitions right and not DF? Because DF are already divided into 4, 4 partitions each. 2. Now each join will make 200 new partitions, so if we join RP1 and BP3 so it will create total 200 more partitions? And this way if we'll join each partition in Red with every partition in Blue, then total we'll have 3200 partitions? 3. In the video you said - not 200 partitions per executor but executor does have 200 partitons - 100 for Red and 100 Blue.

@akhiladevangamath1277 6 ай бұрын

Hey, This is my understanding, my answers might help you to understand 1. we r applying join DF, yes we have 4 partitions for each DF. when we apply join, those 4 partitions will made into 200 partitions. 3. 200 partitions for each DF, so each executor has 100 partitions of DF1 and 100 partitions of DF2.

@ManishSharma-fi2vr 6 ай бұрын

Thanks Manish Bhai!!

@homeactfun Жыл бұрын

Amazing video

@ajaypatil1881 Жыл бұрын

Will you please make video on O(n^2) ? what actually it is

@vishaljoshi1752 Жыл бұрын

hi manish one more question you are saying in-memory for hash-table but as we know first data is loaded in executor memory and logical operation are performed so in shuffle-sort join all the things are performing in memory so why we are not saying shuffle-sort join in-memory as both the partitions for the same key should be loaded in-memory then after join operation will be performed ?

@Daily_Code_Challenge 13 күн бұрын

We can't say because shuffle sort-merge uses disk also while hash-table relies heavily on the hash table being entirely in memory,

@rohanchoudhary672 10 ай бұрын

Nice video sir, but use modulus operation, divide is little confusing.

@manish_kumar_1 10 ай бұрын

Modulus operator dekhiye kaise kaam karta hai

@rohanchoudhary672 10 ай бұрын

@@manish_kumar_1 aap remainder hi to lerhe ho 200 ka

@quiet8691 7 ай бұрын

Tera intro mujhe namaskar mai ravish Kumar jaisa lagta h 👍👌🔥

@sreelakshmang7275 6 ай бұрын

how to know dataframe size?

@raajnghani Жыл бұрын

I am working as Operation Executive in a warehouse, but I started learning sqoop, hive, MySQL, MongoDB, Hbase, Nifi, Kafka, spark, AWS Services. It is completely Non-IT, I cleared two interviews. How do I get an experience certificate for working on above technologies.

@manish_kumar_1 Жыл бұрын

Tell them that you don't have experience. You have done all the project by your own. If you cleared interview means you are good fit for the role.

@raajnghani Жыл бұрын

@@manish_kumar_1 Recuiter need experience after clearing l2 discussion also.

@adityakvs3529 Ай бұрын

Bhai which join is better shuffle hash or sort merge and how spark decides which join it needs to use

@KaranSingh-hx8dh Жыл бұрын

Thank you for explaining.

@Amarjeet-fb3lk 6 ай бұрын

200 partition banega,means 200 cores bhi chahiye hoga, Tabhi to 200 partition banega. Agar 200 cores nahi hua to?

@manish_kumar_1 6 ай бұрын

Tab bhi chalega. Distributed computing ka kaam hi hai aapke Kam resource me v job chalane ka. Aapko Pura spark samjhne ke liye to Pura playlist dekhna parega

@younevano 18 күн бұрын

It will run 200/n times where n= number of cores!

@sanooosai 8 ай бұрын

great sir thank you

@RohitKumar-kd5fj 2 ай бұрын

DIvision hoga kya ? Mereko lagra hai modulus hoga

@Daily_Code_Challenge 13 күн бұрын

yes wo modulus hai

@RajeshKumar-re8tj 6 ай бұрын

Which memory pool utilizing to create hash table during shuffle hash join?

@younevano 19 күн бұрын

Executor's those partitions are on after shuffling?

@mhdakram 4 ай бұрын

An executor can have only one partition at a time...is this not correct?

@akumar2575. 7 ай бұрын

day 4 done👍

@mayanksinghsoni Ай бұрын

what if the id is not numeric?

@mdasif2411 6 ай бұрын

Jb salary table 10MB se km h r phla table itna zyada, toh dono m same no. of partitions kaise bnega?

@rameshbayanavenkata1305 Жыл бұрын

Hi Manish..i am following all your videos. Thanks for your great contribution in explaining each and every thing in detail. As you said records will be segregated in each partition as per the reminder which we get from dividing id value with 200 partitions. What if the joining is done on name column instead of id. how division takes place here to segregate name column in each partition. pls clarify..

@amritranjannayak2705 9 ай бұрын

I also have same question, Please answer this.

@younevano 18 күн бұрын

@@amritranjannayak2705 he replied on same other comment murmur3 hashing is done for joining on strings!

@kartikgupta2299 4 ай бұрын

Per executor 200 partition bante dikhre hai as in your vedio but aap bolre ho per executor 200 nhi banege total 200 partition banenge please ye part explain kro aur 200 by default kyu bante h