repartition vs coalesce | Lec-12

  Рет қаралды 23,717

MANISH KUMAR

MANISH KUMAR

Күн бұрын

In this video I have talked about repartition vs coalesce in spark. If you want to optimize your process in Spark then you should have a solid understanding of this concept.
Directly connect with me on:- topmate.io/man...
Flight Data link:- github.com/dat...
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

Пікірлер: 70
@dayanandab.n3814
@dayanandab.n3814 Күн бұрын
The practical knowledge really helping us to relate with thoery. Thank you Manish Bhai you are doing a great job.
@dayanandab.n3814
@dayanandab.n3814 Күн бұрын
I have clearly learnt Repartition and Coaleasce, Thank you Manish Bhai.
@dataman17
@dataman17 8 ай бұрын
You make the subject interesting. Best channel in Data engineering. Thank you for the videos. Looking forward to more!
@rahuljain8001
@rahuljain8001 Жыл бұрын
HI Manish, I have tried the same, its actually partitons are not remove, we should use partitioned_on_column.withColumn("partition_id",spark_partition_id()).groupBy("partition_id").count().orderBy("partition_id").show(300) to check the partions
@vishaljare1284
@vishaljare1284 7 ай бұрын
After purchasing a 25k course.I recently came to your KZbin channel I realised that your free course is worth than this.
@AkshayBaishander
@AkshayBaishander 4 ай бұрын
Which course you took?
@gamesandgossips
@gamesandgossips 2 ай бұрын
Kha se liya h??
@RiyaBiswas-r1p
@RiyaBiswas-r1p 7 ай бұрын
I got so many of my doubts cleared from your videos, they are put together so well, they are very easy to understand.
@ankitachauhan6084
@ankitachauhan6084 6 ай бұрын
very good teaching style with clearity thanks !
@vishaljare1284
@vishaljare1284 7 ай бұрын
I like your teaching style. The way you are explaining is excellent.keep going bro
@kamalprajapati9955
@kamalprajapati9955 2 ай бұрын
Thank you for this detailed tutorial.
@TaherAhmed16
@TaherAhmed16 Жыл бұрын
Well Explained..!! One question. As we discussed in the earlier sessions that the rdds are immutable, So when we do a repartition or coalesce, the old RDD with imbalanced data also still exists on the executor nodes along with the new repartitioned data? If yes then at what point it gets cleared, as it will keep increasing the disk on the executor nodes? Should we do that manually in the code?
@TaherAhmed16
@TaherAhmed16 Жыл бұрын
I was looking at your executor out of memory video, so looks like all RDDs will be there on the executor but they will be spilled on the disk based on LRU.
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Aapka understanding thora galat hai. Same data is referenced everytime. Aisa nahi hai ki data ek baar aur copy hoga. When you hit the action same data is refernced. DAG se usko pata chalta hai ki which one to pick. LRU only cached data ko evict karte rahta hai
@TaherAhmed16
@TaherAhmed16 Жыл бұрын
@@manish_kumar_1 Got it, thanks .
@prabhatgupta6415
@prabhatgupta6415 Жыл бұрын
Godfather of SPARK
@ajaypatil1881
@ajaypatil1881 Жыл бұрын
Another great video Bhaiya♥
@ManishSharma-fi2vr
@ManishSharma-fi2vr 6 ай бұрын
❤ Thanks Manish Bhaiya!!
@abhishekrajput4012
@abhishekrajput4012 3 ай бұрын
Thank you Manish Bhaiya
@omkarm7865
@omkarm7865 Жыл бұрын
Great work
@muskangupta735
@muskangupta735 7 ай бұрын
Great explanation
@SqlMastery-fq8rq
@SqlMastery-fq8rq 8 ай бұрын
well Explained Sir
@surabhisasmal882
@surabhisasmal882 Жыл бұрын
Great video Manish, very informative. Recently I was asked if we have 200 partitions , would we prefer repartition(1) or coalesce(1) . Any insights pls?
@TheSmartTrendTrader
@TheSmartTrendTrader Жыл бұрын
Repartition(1) and Coalesce(1) both outputs single partition however from a performance point of view I would prefer using Coalesce because it simply merges all input partitions into one without shuffling and repartition would do exchange partitioning under the hood and takes slightly more time than coalesce. You can look at the explain plan of both and it will be visible over there.
@divyanshusingh3966
@divyanshusingh3966 2 ай бұрын
Bhai m apki playlist follow krha sab kuch bhut accha h bs yai do playlist bnana kffi confusing h apko merge krdena chaiye tha..
@raghavsisters
@raghavsisters Жыл бұрын
I think there will be method to find how much optimal partition we can do. So if I have large data set then it’s difficult to try partition size and time for each partition .
@younevano
@younevano 20 күн бұрын
Have you found if there's any method to find the optimal number of partitions?
@saumyasingh9620
@saumyasingh9620 Жыл бұрын
Well explained! Thanks...keep posting..... I have been asked on reduceByKey in some interview, that also please explain in some session. I am not clear whether we can use it with dataframe or only rdd is required to apply it. Please comment.
@soumyaranjanrout2843
@soumyaranjanrout2843 11 ай бұрын
Directly we can't perform reduceByKey on Dataframe. We need to convert it to rdd then we can apply the reduceByKey. df.rdd.reduceByKey(anonymous function...)
@DpIndia
@DpIndia Жыл бұрын
Nice tutorial, all clear
@vishenraja
@vishenraja 6 ай бұрын
Hi Manish, As, you mention that in repartition data will be evenly distributed. so, if best-selling product distributed among multiple partition, then how join will work as for join same key should be on same partition. Could you please explain this?
@kartikjaiswal8923
@kartikjaiswal8923 5 ай бұрын
nice explanation
@sakshijain5503
@sakshijain5503 6 ай бұрын
Hello Sir, What is the difference between repartition and BucketBy? Thank You!
@younevano
@younevano 20 күн бұрын
Repartition (and coalescing) is the physical distribution of data in-memory or disk whereas Bucketing is the logical organization of data based on the hash values of the column it's bucketed by.
@engineerbaaniya4846
@engineerbaaniya4846 Жыл бұрын
Awesome content
@user93-i2k
@user93-i2k 2 ай бұрын
one doubt bhaiya, usually we avoid repartitioning right? unless we have a very large file otherwise it will create "small file problem" is my understanding correct?
@sanooosai
@sanooosai 8 ай бұрын
great sir thank you
@kalyanreddy496
@kalyanreddy496 Жыл бұрын
Consider you have read 1GB file into a dataframe. The max partition bytes configuration is set to 128MB. you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario? Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Your partition will be of bigger size. Around 250 MB each in deserialzed form. But when you will write as parquet then snappy compression will reduce the size. Let say after reducing it becomes 150 mb then total size after compression will become 159*4=600mb. If you try to read again after writing then it will read into 600/128=5 partitions
@soumyaranjanrout2843
@soumyaranjanrout2843 11 ай бұрын
Spark will not throw an error and it will go against the configuration.It will create 4 partitions regardless of maxPartitionBytes and now each partition will hold 1GB/4 = 250(approximately) as Manish sir told. But it can lead the performance degradation. Moreover maxPartitionBytes is configurable so if we work with larger dataset then we can configure it as per the use case scenario. By the way thanks for your question. Just because of your question I get to know some other stuff other than the video.
@pde.joggiri
@pde.joggiri Жыл бұрын
doubt 1: Repartition(1) vs Coalesce(1) is there any diffrence? Which one should we use when writing as single file. doubt 2: I was reading multiple csv files(6) into dataframe then I write with coalesce(1) again overwrite with coalesce(10). It is givin 6 partitions. why partition size increased with coalsce().
@younevano
@younevano 22 күн бұрын
For small to medium datasets: Use coalesce(1) as it is faster and sufficient for creating a single file. For large datasets with skewed partitions: Use repartition(1) to balance the data first to optimize the data for shuffle-heavy operations later. If you overwrite with coalesce(10) after coalesce(1), Spark may re-evaluate while output partitioning and fall back to default behavior (e.g., using the number of input files as partitions)
@poojajoshi871
@poojajoshi871 Жыл бұрын
Hi Sir, in withcolumn line we are adding partionid as column but how we are putting the value in that column as no literal is being introduced also can you please explain on spark_partition_id(). y are we using
@vilaspatil-r3q
@vilaspatil-r3q Жыл бұрын
WithColumn se hum ek naya column add karte hain. aur spark_partition_id() is bulit in method available in spark.sql.functions
@younevano
@younevano 20 күн бұрын
spark_partition_id() is the literal being passed!
@akumar2575.
@akumar2575. 7 ай бұрын
day 2 done 👍
@NirajAgrawal-e6v
@NirajAgrawal-e6v Жыл бұрын
Can you please explain the repartition and coalesce with dstaframe joining realtime example so we can see the real time optimization of joining process
@ShivamGupta-wn9mo
@ShivamGupta-wn9mo 26 күн бұрын
Good
@adityakvs3529
@adityakvs3529 17 сағат бұрын
Bhai i have doubt suppose we have more records on a particular id, for example ID1 then even when we repartition, we wont get evenly distributed data right
@raghavsisters
@raghavsisters Жыл бұрын
Do you have git page for code?
@vilaspatil-r3q
@vilaspatil-r3q Жыл бұрын
according to our data agar hum correct partition karte hain toh kya hamara execution time decrease hota hai kya??
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Yes
@vilaspatil-r3q
@vilaspatil-r3q Жыл бұрын
​@@manish_kumar_1 Thank you for the confirmation. Manish i have watched your all videos theory and practical its really awesome. you have explained in simple way that makes it very worth 🙏❤
@CctnsHelpdesk
@CctnsHelpdesk 8 ай бұрын
great
@alokkumarmohanty8454
@alokkumarmohanty8454 Жыл бұрын
Hi Manish, Thanks for all your videos. I personally got to know so many thing from these videos. I have a doubt here For any give instance how we will decide the no. of partition for both repartition and coalesce. I mean repartition(10). How we decide the no.-10 for exapmle
@younevano
@younevano 22 күн бұрын
he told in video, it's trial and error
@vishalmane3139
@vishalmane3139 Жыл бұрын
Bro ye jo interview questions different companies k daalé hai tumne, agar tumhra DE ka roadmap follow krenge, then will we be able to answer those interview questions of different companies?
@rp-zf3ci
@rp-zf3ci Жыл бұрын
@manish please explain bucketing concept in spark
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
Already did
@rp-zf3ci
@rp-zf3ci Жыл бұрын
@@manish_kumar_1 as per that video, you mentioned 5 buckets will be created after repartition(5). But I think it should be 5*5 = 25 buckets. 5 buckets for each task. Please correct me if I'm wrong. Thanks.
@younevano
@younevano 20 күн бұрын
@@rp-zf3ci yeah, even I think so, 25 buckets! Did you confirm? In that 'Partitioning and Bucketing in Spark' video, he told each wide transformation causes 200 tasks i.e. 200 partitions, so df.repartition(5).write.bucketBy(5, "col_name") should do below right? repartition(5) makes it 5 partitions! And then bucketBy(5) applied later makes each partition into 5 buckets, so in total 25 buckets right?
@younevano
@younevano 20 күн бұрын
@@manish_kumar_1 Can you clarify the below comment?
@younevano
@younevano 20 күн бұрын
What I found: After repartition(5), when we use bucketBy(5, "col_name"), the data is organized into 5 buckets based on the specified column. However, bucketBy does not further divide each partition into smaller buckets. Instead, it logically organizes the data into 5 buckets total across the dataset. This organization happens when the data is written out, typically as a table or files in a format that supports bucketing. repartition(5) ensures there are exactly 5 partitions. Partitions are about physical data distribution in memory or disk. Buckets are a logical organization of data based on hash values of the specified column(s). Bucketing is not equivalent to repartitioning, and it does not multiply the number of partitions by the number of buckets. So, the total number of buckets will be 5, not 25.
@udittiwari8420
@udittiwari8420 10 ай бұрын
thankyou sir
@RahulPatil-iu2sp
@RahulPatil-iu2sp Жыл бұрын
Hi Manish sir, If we are processing a 1TB file on a 10 nodes cluster (64 GB RAM each), then will it get processed or throw an OOM error? Could you please explain this?
@manish_kumar_1
@manish_kumar_1 Жыл бұрын
In short, It will run. Memory management is vast topic and explaining OOM is not simple at least in comment. I will make dedicated video on this topic. Stay connected with the channel
@RahulPatil-iu2sp
@RahulPatil-iu2sp Жыл бұрын
Thanks for the clarification @Manish Sir. I'll stay tuned👍👍
@akhilesh2186
@akhilesh2186 Жыл бұрын
You are second Khan Sir of Bihar. Just a suggestion to add some jokes in between to keep it lighter. Unfortunately your Views are not that much as there are no more audience but I bet no one can explain better than you. I guess if you will make same video in English as well then you will have more subscribers as well as viewers. I am from Hindi/Bhojpuri region (Varanasi) and like your videos a lot but whenever I am looking at your views and subscribers then thinking if I can increase them by any way that's why giving the suggestion.
@mrunmaygosavi3062
@mrunmaygosavi3062 9 ай бұрын
pros and corn mat bola karo... hume bhuk lag jati hai.
Spark Join | Sort vs Shuffle | Spark Interview Question | Lec-13
19:42
Spark  - Repartition Or  Coalesce
10:02
Data Engineering
Рет қаралды 20 М.
Why no RONALDO?! 🤔⚽️
00:28
Celine Dept
Рет қаралды 118 МЛН
22. Databricks| Spark | Performance Optimization | Repartition vs Coalesce
21:11
Raja's Data Engineering
Рет қаралды 56 М.
6. Difference Between Repartition and Coalesce in Databricks Spark
15:00
Broadcast Join in spark | Spark Interview Question | Lec-14
27:20
MANISH KUMAR
Рет қаралды 25 М.
dataframe transformations in spark | Lec-12
24:43
MANISH KUMAR
Рет қаралды 14 М.
Partition vs Bucketing | Data Engineer interview
8:53
learn by doing it
Рет қаралды 10 М.
Handling corrupted records in spark | PySpark | Databricks
19:36
MANISH KUMAR
Рет қаралды 30 М.
Why no RONALDO?! 🤔⚽️
00:28
Celine Dept
Рет қаралды 118 МЛН