74. Databricks | Pyspark | Interview Question: Sort-Merge Join (SMJ)

  Рет қаралды 16,903

Raja's Data Engineering

Raja's Data Engineering

Күн бұрын

Пікірлер: 40
@omprakashreddy4230
@omprakashreddy4230 2 жыл бұрын
You are here to make our lives simple. Thank you so much !!
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thank you Omprakash
@moviestime2346
@moviestime2346 Жыл бұрын
No one can explain better than this..Thanks raja for your efforts and time.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thanks for your comment. Glad it helps you
@vineethreddy.s
@vineethreddy.s 2 жыл бұрын
Say if we have deptid 111 in emp table a million times and deptid 111 in dept table over 500k times. During the shuffle spark would create 200 partitions. So deptid 111 of emptable may split across 20 partitions and deptid 111 of depttable may split across 10 partitions and if the sort and merge is performed on these partitions, then this would result in partial join. How does spark handle it internally?
@taikoktsui_sithlord
@taikoktsui_sithlord 4 ай бұрын
to-the-point explanation, thanks!
@rajasdataengineering7585
@rajasdataengineering7585 4 ай бұрын
Glad it was helpful! Thanks
@JimRohn-u8c
@JimRohn-u8c 5 ай бұрын
Is this the same as the Sort-Merge-Bucket (SMB) join?
@rebalaashishreddy9908
@rebalaashishreddy9908 2 жыл бұрын
Best channel for data bricks
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thank you
@suresh.suthar.24
@suresh.suthar.24 Жыл бұрын
hats of to you sir g ur explanation is next level.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thank you, Suresh!
@Animationslaura
@Animationslaura Жыл бұрын
The best explanation i´ve seen.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thank you
@venkatasai4293
@venkatasai4293 2 жыл бұрын
Good explanation Raja. Few questions 1)Does number of partitions determined by number of cores in the cluster or input split size for example s3 bucket 128MB 2)what happens if the partition size greater than the executor size . Does it spill to the disk ? Is that impacts the performance ?
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks Venkat. 1. Number of partitions are determined by various factors. If the input file is in splittable format, each core will start read the data in parallel and each core can produce one partition at 128 mb size. If the input file is much bigger, each core will produce multiple partitions of 128 mb. So number of partitions will be in multiples of number of cores. 2. Usually partition size does not exceed executor onheap memory. If Dataframe (multiple distributed partitions across cluster) size is exceeding total size of on heap memory, it leads to data spill. So few partitions will be stored in local disk of worker node. Splilled data hits the performance as it needs to be recalculated every time. Hope it helps
@venkatasai4293
@venkatasai4293 2 жыл бұрын
@@rajasdataengineering7585 thanks raja
@bhargavkumar4724
@bhargavkumar4724 2 жыл бұрын
Excellent Explanation!!!
@oleg20century
@oleg20century 8 ай бұрын
Hello! 1 executor unit is not 1 worker node unit? Maybe this worker node 1 is rack or little cluster? Or maybe this executors is actually containers (cores) on 1 executor (worker)?
@pavankumarveesam8412
@pavankumarveesam8412 10 ай бұрын
But in the third stage its not completed right lets say there is one more filter operation on the data frame it will still be in that stage only but if the data frame encounters a shuffle operation like join there will be another stage correct ?
@rajasdataengineering7585
@rajasdataengineering7585 10 ай бұрын
Yes that's right. Only when there is shuffle through wide transformation, new stage would be created
@pavankumarveesam8412
@pavankumarveesam8412 10 ай бұрын
@@rajasdataengineering7585 thanks Raja
@mohitupadhayay1439
@mohitupadhayay1439 Жыл бұрын
Is this why we use BROADCAST join? Because normal joins are expensive?
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Exactly, this is the reason why we need to use broadcast join to avoid expensive sort merge join
@mohitupadhayay1439
@mohitupadhayay1439 Жыл бұрын
@@rajasdataengineering7585 One more question : How can we use broadcast if the small df couldn't occupy the memory? Wouldn't the data spill from the memory?
@prabhatgupta6415
@prabhatgupta6415 Жыл бұрын
Sir i have seen multiple join strategies are there . I could find in ur playlist.
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
That's great
@aswaniyettapu9992
@aswaniyettapu9992 2 жыл бұрын
Very good explanation
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thank you
@prathapganesh7021
@prathapganesh7021 7 ай бұрын
Excellence explanation thank you
@rajasdataengineering7585
@rajasdataengineering7585 7 ай бұрын
Glad it was helpful! Thanks Prathap
@saikiran-pl4cc
@saikiran-pl4cc 2 жыл бұрын
Thank you for clear explaination
@rahulmittal116
@rahulmittal116 2 ай бұрын
Hats off
@rajasdataengineering7585
@rajasdataengineering7585 2 ай бұрын
Thank you
@srinubathina7191
@srinubathina7191 Жыл бұрын
Thank You Sir
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Most welcome
@bikeshtiwari6418
@bikeshtiwari6418 Жыл бұрын
ur awsome Spark Guru
@rajasdataengineering7585
@rajasdataengineering7585 Жыл бұрын
Thanks
@vineethreddy.s
@vineethreddy.s 2 жыл бұрын
Thanks, Helpful
@rajasdataengineering7585
@rajasdataengineering7585 2 жыл бұрын
Thanks
75. Databricks | Pyspark | Performance Optimization - Bucketing
22:03
Raja's Data Engineering
Рет қаралды 18 М.
map join, skew join, sort merge bucket join in hive
19:37
Gagan Shivale
Рет қаралды 13 М.
Who’s the Real Dad Doll Squid? Can You Guess in 60 Seconds? | Roblox 3D
00:34
НАШЛА ДЕНЬГИ🙀@VERONIKAborsch
00:38
МишАня
Рет қаралды 2,6 МЛН
Synyptas 4 | Арамызда бір сатқын бар ! | 4 Bolim
17:24
Когда отец одевает ребёнка @JaySharon
00:16
История одного вокалиста
Рет қаралды 14 МЛН
35.  Join Strategy in Spark with Demo
33:48
CloudFitness
Рет қаралды 14 М.
Advancing Spark - Understanding the Spark UI
30:19
Advancing Analytics
Рет қаралды 53 М.
[100% Interview Question] Broadcast Join Spark | Increase  Spark Join Performance
6:59
10 recently asked Pyspark Interview Questions | Big Data Interview
28:36
Who’s the Real Dad Doll Squid? Can You Guess in 60 Seconds? | Roblox 3D
00:34