Spark Join | Sort vs Shuffle | Spark Interview Question | Lec-13

  Рет қаралды 30,863

MANISH KUMAR

MANISH KUMAR

Күн бұрын

Пікірлер: 94
@jatinchugh6752
@jatinchugh6752 Жыл бұрын
Bhai itni information to aj tak kisi paid course me bhi nahi mili, thank you so much.
@DivyaSharma-ux4mo
@DivyaSharma-ux4mo 7 ай бұрын
This is so true, admire your hard work.!!!!!!
@hankeepankee5361
@hankeepankee5361 4 ай бұрын
18:55 Tradeoff between cpu usage (shuffle sort join) and in-memory usage (shuffle hash join)
@hankeepankee5361
@hankeepankee5361 6 ай бұрын
Good work bro... In hash join creating the hash would take O(N) -> N being number of unique values in the column. So hash join would take O(N) vs sort join which is O (NlogN)
@AnimeOverload15
@AnimeOverload15 Ай бұрын
I have not got a detailed video more than this in my entire career
@jagannathsahoo8297
@jagannathsahoo8297 28 күн бұрын
excellent explanation. and that too free of cost ☺
@mrinalraj4801
@mrinalraj4801 7 ай бұрын
Great in depth concepts. Maja aa gya. You are genius. Thanks a lot. Keep up the great work you're doing for the community.
@anuragdwivedi1804
@anuragdwivedi1804 4 күн бұрын
bro i have never seen a detailed vedio like this
@mprtech315
@mprtech315 8 ай бұрын
I follow your both spark series. Really its valuable for me 🎉 thanks
@sambitmohanty1758
@sambitmohanty1758 Жыл бұрын
Hi Manish, your content are amazing, keep it up.
@manojkaransingh5848
@manojkaransingh5848 Жыл бұрын
amazing...!!!!!!!!! ...video bhaiii....@@@
@udittiwari8420
@udittiwari8420 10 ай бұрын
Thank you sir for the detailed series! Your clear explanations have been incredibly helpful in my learning journey.
@prabhatsingh7391
@prabhatsingh7391 Жыл бұрын
Hi Manish Bhaiya, here we perform the join based on key= id that is an integer so we can see that id%200 is the partition number where data will go ,but if key= string then ,how it will happens or in that scene internally spark create a key for each column.
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Murmur3 hashing is applied for strings. If you want to know more then check how murmur3 works
@divyanshusingh3966
@divyanshusingh3966 2 ай бұрын
Thank you bro for providing quality content for free
@voice6905
@voice6905 4 ай бұрын
Apko KOTI KOTI PRANAM GURU JI! Please bring playlists on Apache AIRFLOW and Apache KAFKA. I'm sure they would be the best resource in the KZbin
@venkatmunna8918
@venkatmunna8918 Ай бұрын
Thank you so much for the detailed explanation, However, I am confused about one point. Could you please clarify my question? Let's say we don't have the color coding as blue and red. Now, executor-1 has 200 partitions and executor-2 also has 200 partitions. If we consider id=102, then 102/200 = 102. How does spark determine whether the record 102 should go toexecutor-1/executor-2 ? This is discussed at the 10:56 timestamp. Thanks!
@rohitsharma-mg7hd
@rohitsharma-mg7hd Ай бұрын
bhai pehle to ye batao 102/200=102 kaise ho gaya ? maths ati hai
@shivaog007
@shivaog007 Ай бұрын
@@rohitsharma-mg7hd we are taking the remainder
@rohitsharma-mg7hd
@rohitsharma-mg7hd Ай бұрын
@@shivaog007 ha bhai bataya unhone , mereko galatfehmi ho gai thi hui hui
@Daily_Code_Challenge
@Daily_Code_Challenge 13 күн бұрын
executor 1 is taking 1 to 100 and executor 2 101 to 200 colour is showing 2 different table (df1 ,df 2)is created per executor
@adityakvs3529
@adityakvs3529 Ай бұрын
bhai hash table is created at individual partiton or entire data frame in shuffle hash join
@younevano
@younevano 19 күн бұрын
Partition level
@rajnandinipadhy2533
@rajnandinipadhy2533 Жыл бұрын
so if in interview recuirter will ask what kind of join you are performing then should we say as per the data we need to analyze first what kind join should be appropriate for this or we should as spark will do the optimization internally?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
You can talk about types of join strategy and then give a comparison between 2 by taking some dataframe size. If interviewer further asks anything then only explain in detail.
@anuragdwivedi1804
@anuragdwivedi1804 3 күн бұрын
bro can you please tell what book do you follow for spark?
@abhigyanprakash5603
@abhigyanprakash5603 4 күн бұрын
One doubt: You explained joining on the basis of id column where you showed 1/200 gives remainder as 1 --> So, You placed the record in executor 1 with P1... Similarly 109/200 gives remained 109 --> So, You placed the record in P109. But Now assume instead of joining the records based on integer column, we are joining records based on String (Char or VARCHAR datatype). Then, how will this thing work ?
@anish_bhateja
@anish_bhateja Жыл бұрын
Hi Manish, Excellent explanation. Thanks for the informative video.
@maurifkhan3029
@maurifkhan3029 Жыл бұрын
is it like every dataframe is split into 200 partitions before shuffling (based on number of shuffle partitions set) ? or is it like if we have 2 Dataframe to join each will get only 100 shuffle partitions
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
No it's not like ki every dataframe will get 100. So based on joining condition 200 partitions will get created. And then you can consider 200 bucket is there and every bucket has the same joining key records. Let say df1 had id 5 is in box no 5 then from df2 also id 5 will come to box5 and then box5 is self sufficient to join.
@HanuamnthReddy
@HanuamnthReddy 10 ай бұрын
Really exemplary 🎉
@khurshidhasankhan4700
@khurshidhasankhan4700 Жыл бұрын
Could you please ek video class and case class pr video Bana dijiye maximum interview me puch raha hai
@lakshya1375
@lakshya1375 Жыл бұрын
Bhai Optimization technique bataya h kya aapne kisi video me?
@Nomanqureshi2204
@Nomanqureshi2204 8 ай бұрын
sir spark streaming par video banaiye
@sachindubey4315
@sachindubey4315 Жыл бұрын
how these 200 partitions spilited into 2 executor ? what if there is 3 or 4 executor are there how split of 200 partiton will be heppen ? ?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Then partition will be distributed over 4 executors
@prathapganesh7021
@prathapganesh7021 Жыл бұрын
Hi you said 100 partitions in each executor but in one executor you demonstrate blue and red in one executor counts 200 could you please elaborate that. Thank you
@diksha.chaudhary
@diksha.chaudhary Жыл бұрын
hey Manish, your videos are amazing!! 👏 love the way you explain each and every detail. thankyou for sharing your knowledge and keep it up. ✨️
@vikashroy5882
@vikashroy5882 5 ай бұрын
Hi Manish If we follow the approach mentioned at this timestamp 9:28 , then in which partition data will go if we have 0 remainder. Ex- if we have Id as 200 or multiple of 200
@Daily_Code_Challenge
@Daily_Code_Challenge 13 күн бұрын
2nd
@prathapganesh7021
@prathapganesh7021 Жыл бұрын
Thank you great explanation 🙏
@vishaljoshi1752
@vishaljoshi1752 Жыл бұрын
hi manish as you said sorting is nlogn and what if we combine the data suppose p1 of table has id 1 and p2 has id 1,1 then if we combine two for loops are required for this then complexity n2 .. is it perform in the same way?
@rishavsharma5732
@rishavsharma5732 2 ай бұрын
Baal kharab hogaya..xD, nice work btw..these videos are really helpful.
@aashishraja-k7u
@aashishraja-k7u 3 ай бұрын
well explained
@nityabajpai2022
@nityabajpai2022 Жыл бұрын
Hi Manish, I have few questions : 1. We are applying join on partitions right and not DF? Because DF are already divided into 4, 4 partitions each. 2. Now each join will make 200 new partitions, so if we join RP1 and BP3 so it will create total 200 more partitions? And this way if we'll join each partition in Red with every partition in Blue, then total we'll have 3200 partitions? 3. In the video you said - not 200 partitions per executor but executor does have 200 partitons - 100 for Red and 100 Blue.
@akhiladevangamath1277
@akhiladevangamath1277 6 ай бұрын
Hey, This is my understanding, my answers might help you to understand 1. we r applying join DF, yes we have 4 partitions for each DF. when we apply join, those 4 partitions will made into 200 partitions. 3. 200 partitions for each DF, so each executor has 100 partitions of DF1 and 100 partitions of DF2.
@ManishSharma-fi2vr
@ManishSharma-fi2vr 6 ай бұрын
Thanks Manish Bhai!!
@homeactfun
@homeactfun Жыл бұрын
Amazing video
@ajaypatil1881
@ajaypatil1881 Жыл бұрын
Will you please make video on O(n^2) ? what actually it is
@vishaljoshi1752
@vishaljoshi1752 Жыл бұрын
hi manish one more question you are saying in-memory for hash-table but as we know first data is loaded in executor memory and logical operation are performed so in shuffle-sort join all the things are performing in memory so why we are not saying shuffle-sort join in-memory as both the partitions for the same key should be loaded in-memory then after join operation will be performed ?
@Daily_Code_Challenge
@Daily_Code_Challenge 13 күн бұрын
We can't say because shuffle sort-merge uses disk also while hash-table relies heavily on the hash table being entirely in memory,
@rohanchoudhary672
@rohanchoudhary672 10 ай бұрын
Nice video sir, but use modulus operation, divide is little confusing.
@manish_kumar_1
@manish_kumar_1 10 ай бұрын
Modulus operator dekhiye kaise kaam karta hai
@rohanchoudhary672
@rohanchoudhary672 10 ай бұрын
@@manish_kumar_1 aap remainder hi to lerhe ho 200 ka
@quiet8691
@quiet8691 7 ай бұрын
Tera intro mujhe namaskar mai ravish Kumar jaisa lagta h 👍👌🔥
@sreelakshmang7275
@sreelakshmang7275 6 ай бұрын
how to know dataframe size?
@raajnghani
@raajnghani Жыл бұрын
I am working as Operation Executive in a warehouse, but I started learning sqoop, hive, MySQL, MongoDB, Hbase, Nifi, Kafka, spark, AWS Services. It is completely Non-IT, I cleared two interviews. How do I get an experience certificate for working on above technologies.
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Tell them that you don't have experience. You have done all the project by your own. If you cleared interview means you are good fit for the role.
@raajnghani
@raajnghani Жыл бұрын
@@manish_kumar_1 Recuiter need experience after clearing l2 discussion also.
@adityakvs3529
@adityakvs3529 Ай бұрын
Bhai which join is better shuffle hash or sort merge and how spark decides which join it needs to use
@KaranSingh-hx8dh
@KaranSingh-hx8dh Жыл бұрын
Thank you for explaining.
@Amarjeet-fb3lk
@Amarjeet-fb3lk 6 ай бұрын
200 partition banega,means 200 cores bhi chahiye hoga, Tabhi to 200 partition banega. Agar 200 cores nahi hua to?
@manish_kumar_1
@manish_kumar_1 6 ай бұрын
Tab bhi chalega. Distributed computing ka kaam hi hai aapke Kam resource me v job chalane ka. Aapko Pura spark samjhne ke liye to Pura playlist dekhna parega
@younevano
@younevano 18 күн бұрын
It will run 200/n times where n= number of cores!
@sanooosai
@sanooosai 8 ай бұрын
great sir thank you
@RohitKumar-kd5fj
@RohitKumar-kd5fj 2 ай бұрын
DIvision hoga kya ? Mereko lagra hai modulus hoga
@Daily_Code_Challenge
@Daily_Code_Challenge 13 күн бұрын
yes wo modulus hai
@RajeshKumar-re8tj
@RajeshKumar-re8tj 6 ай бұрын
Which memory pool utilizing to create hash table during shuffle hash join?
@younevano
@younevano 19 күн бұрын
Executor's those partitions are on after shuffling?
@mhdakram
@mhdakram 4 ай бұрын
An executor can have only one partition at a time...is this not correct?
@akumar2575.
@akumar2575. 7 ай бұрын
day 4 done👍
@mayanksinghsoni
@mayanksinghsoni Ай бұрын
what if the id is not numeric?
@mdasif2411
@mdasif2411 6 ай бұрын
Jb salary table 10MB se km h r phla table itna zyada, toh dono m same no. of partitions kaise bnega?
@rameshbayanavenkata1305
@rameshbayanavenkata1305 Жыл бұрын
Hi Manish..i am following all your videos. Thanks for your great contribution in explaining each and every thing in detail. As you said records will be segregated in each partition as per the reminder which we get from dividing id value with 200 partitions. What if the joining is done on name column instead of id. how division takes place here to segregate name column in each partition. pls clarify..
@amritranjannayak2705
@amritranjannayak2705 9 ай бұрын
I also have same question, Please answer this.
@younevano
@younevano 18 күн бұрын
@@amritranjannayak2705 he replied on same other comment murmur3 hashing is done for joining on strings!
@kartikgupta2299
@kartikgupta2299 4 ай бұрын
Per executor 200 partition bante dikhre hai as in your vedio but aap bolre ho per executor 200 nhi banege total 200 partition banenge please ye part explain kro aur 200 by default kyu bante h
@mohammadfurquan241
@mohammadfurquan241 Жыл бұрын
Sir I have done Python, ,basic SQL, Linux commands All DBMS concepts. CAN I LEARN SPARK NOW OR IS THERE ANY PREREQUISITE FOR SPARK???????
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
No prerequisite. Thora bahut sql aayega tab concept jaldi grasp karoge
@mohammadfurquan241
@mohammadfurquan241 Жыл бұрын
Thank you sir I will follow your series
@harshi993
@harshi993 8 ай бұрын
What is partition ?
@prashanttakate7856
@prashanttakate7856 8 ай бұрын
whenever you are working with a spark, data is divided in some parts, that parts of data is called partition
@neeraj_dama
@neeraj_dama Жыл бұрын
how is 7/200 =7 ?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Remainder 7 aayega. Pmod function lagta hai waha par
@rohitsharma-mg7hd
@rohitsharma-mg7hd Ай бұрын
bhai 102/200=102 kab se hone laga ?
@manish_kumar_1
@manish_kumar_1 Ай бұрын
102%200 bol rhe honge divide galti se bol diya hoga
@rohitsharma-mg7hd
@rohitsharma-mg7hd Ай бұрын
@@manish_kumar_1 ok thanks you , ap bahut hi badiya samjhaee ho. thanks a lot
@manishsingh-cb3pp
@manishsingh-cb3pp 9 ай бұрын
Can you explain this topic more clearly
@mayankkandpal1565
@mayankkandpal1565 11 ай бұрын
@pradipraj5954
@pradipraj5954 9 ай бұрын
improve your video quality
@pradipraj5954
@pradipraj5954 9 ай бұрын
Bahut jyada bakwas karte ho .... Strait point pe raho .....
@radheshyama448
@radheshyama448 Жыл бұрын
thanks
@adityakvs3529
@adityakvs3529 Ай бұрын
Bhai can I have one to one meeting I have some doubts
@manish_kumar_1
@manish_kumar_1 28 күн бұрын
Sure you can book session on topmate
Broadcast Join in spark | Spark Interview Question | Lec-14
27:20
MANISH KUMAR
Рет қаралды 25 М.
74. Databricks | Pyspark | Interview Question: Sort-Merge Join (SMJ)
16:46
Raja's Data Engineering
Рет қаралды 18 М.
路飞做的坏事被拆穿了 #路飞#海贼王
00:41
路飞与唐舞桐
Рет қаралды 26 МЛН
Мен атып көрмегенмін ! | Qalam | 5 серия
25:41
Chain Game Strong ⛓️
00:21
Anwar Jibawi
Рет қаралды 36 МЛН
Beat Ronaldo, Win $1,000,000
22:45
MrBeast
Рет қаралды 150 МЛН
35.  Join Strategy in Spark with Demo
33:48
CloudFitness
Рет қаралды 15 М.
repartition vs coalesce | Lec-12
21:20
MANISH KUMAR
Рет қаралды 23 М.
23. Databricks | Spark | Cache vs Persist | Interview Question | Performance Tuning
18:56
[100% Interview Question] Broadcast Join Spark | Increase  Spark Join Performance
6:59
transformation and action in spark
21:58
MANISH KUMAR
Рет қаралды 44 М.
路飞做的坏事被拆穿了 #路飞#海贼王
00:41
路飞与唐舞桐
Рет қаралды 26 МЛН