22. Databricks| Spark | Performance Optimization

22. Databricks| Spark | Performance Optimization | Repartition vs Coalesce

Рет қаралды 51,855

Raja's Data Engineering

Күн бұрын

Пікірлер: 86

@Sundar_Tenkasi 5 күн бұрын

Lot of information inside this video and much useful and understand the how to customize the partition Thank you

@rajasdataengineering7585 4 күн бұрын

Glad it was helpful! You are welcome!

@Aramakishore 2 жыл бұрын

I have never seen any video elaborated like this..Appreciate you really..It understands very clearly

@rajasdataengineering7585 2 жыл бұрын

Thank you

@Akshaykumar-pu4vi 2 жыл бұрын

Follow this playlist , it is tremendous sir and you provide concepts in a very good way. Thank you sir.

@rajasdataengineering7585 2 жыл бұрын

Thank you Akshay

@mynamesathish 3 жыл бұрын

Nice explanation! In the mentioned example I can see the Repartiton(2) created partition of unequal size(one with 8 record and another with 2records), but I expect it to be of almost equal size.

@riyazalimohammad633 2 жыл бұрын

@Sathish I also had the same doubt when watching the video. repartition(2) created partitions of unequal size but coalesce(2) had partitions with each 5 records per partition. Got me confused. @Raja sir, please clarify on the same.

@rajasdataengineering7585 2 жыл бұрын

@@riyazalimohammad633 Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment

@rajasdataengineering7585 2 жыл бұрын

@Sathish, Sorry for late reply. Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment

@riyazalimohammad633 2 жыл бұрын

@@rajasdataengineering7585 Thank you for your prompt response! Much appreciated.

@somesh512 2 жыл бұрын

I just watched the video and had the exact same doubt. But Raja Sir already provided the answer

@gurumoorthysivakolunthu9878 Жыл бұрын

Great, Sir... 1. What is the maximum value that can be set to - maxPartitionBytes.... 2. What parameters should be considered to decide the partitionbytes , repartition count... Thank you, Sir...

@avinash1722 Ай бұрын

Very Informative. Way better then paid courses

@rajasdataengineering7585 Ай бұрын

Thank you!

@vydudraksharam5960 Жыл бұрын

Raja Sir, Very well explained with example. I would like to know in the pictures you have given 2 executers for repartition and coalesce, but in the same picture you have shown output you named it as executer1 for both. is it by mistake or didn't i understood properly. Could you please clarify. this is difference is there in both the slides. -- Thank you Vydu

@rajasdataengineering7585 Жыл бұрын

Yes it is by mistake

@vipinkumarjha5587 3 жыл бұрын

Vey nice Video Sir, I clear all the basics doubt of Partitioning. Hope to see video on Optimizations approach like cache , persist, z order. Thanks again

@rajasdataengineering7585 3 жыл бұрын

Thank you Vipin. Sure, will post videos with optimization concepts such as cache, persist, Z order in delta etc.,

@shaileshsondawale2811 Жыл бұрын

Wow.. Wonderful Delivery sir...!!!! A wonder content

@rajasdataengineering7585 Жыл бұрын

Thanks Shailesh!

@phanisrikrishna Жыл бұрын

Hi sir, I was looking for a complete pyspark series with more emphasis on architecture and its components. I am having a good learning time with your KZbin series on pyspark. I was wondering if I can get the slides for this course which can help me in referring back quickly when attending interviews.

@vutv5742 6 ай бұрын

Great explaination...🎉🎉🎉

@rajasdataengineering7585 6 ай бұрын

Glad you liked it! Keep watching

@varun8952 2 жыл бұрын

Very detailed explanation, sir.

@rajasdataengineering7585 2 жыл бұрын

Thank you Varun

@gauthamn2844 6 ай бұрын

It was good session is there any indication keyword to set for increase or decrease partition in repartition?. Because repartition (20) how will we know its increased or decreased?. After execution only will come to know its increased/decreased.

@lokeshv4348 11 ай бұрын

At 5:30, There is a mention that snappy and gzip both are not splittable. But snappy is splittable and can have partitions.

@rajasdataengineering7585 11 ай бұрын

All snappy files are not splittable. Snappy with parquet/avro are splittable but snappy with json is not splittable. We can't generalise that all snappy files are splittable or non-splittable

@gulsahtanay2341 7 ай бұрын

Very helpful content, thank you!

@rajasdataengineering7585 7 ай бұрын

Glad it was helpful! You are welcome

@mrpoola49 Жыл бұрын

That was amazingly explained! You rock!

@rajasdataengineering7585 Жыл бұрын

Glad it was helpful!

@ririraman7 2 жыл бұрын

awesome tutorial

@rajasdataengineering7585 2 жыл бұрын

Thanks Ramandeep!

@kamalbhallachd 3 жыл бұрын

Wow amazing

@rajasdataengineering7585 3 жыл бұрын

Thank you Kamal

@vidhyalakshmiparthasarathy8573 Жыл бұрын

Thank you so much sir for making such great videos. I'm learning a lot of nuances and best practices for practical applications.😊🙏

@rajasdataengineering7585 Жыл бұрын

Thank you for your comment! Happy to hear that these videos are helpful to you.

@vishalaaa1 Жыл бұрын

excellent

@rajasdataengineering7585 Жыл бұрын

Thanks Vishal! Glad you liked it

@arindamghosh3787 2 жыл бұрын

This is the video I was searching for .. thanks a lot ❤

@rajasdataengineering7585 2 жыл бұрын

Thanks Arindam!

@BeingSam7 Күн бұрын

18:43 - When you decreased the number of partitions from 20 to 2 using repartition which is supposed to create evenly distributed data then why 1 partition contained 8 records and the other only 2, this is uneven data distribution?

@kamalbhallachd 3 жыл бұрын

Really nice 👍

@rajasdataengineering7585 3 жыл бұрын

Thank you Kamal

@vedantbopardikar3507 8 ай бұрын

All credits to you sir

@rajasdataengineering7585 8 ай бұрын

Thank you! Hope it helps you gaining the knowledge

@robinshaw4641 10 ай бұрын

In real time scenario, when we will use coalsec and when repartiotion?

@sameludhanaraj 5 ай бұрын

well explained.Thanks

@rajasdataengineering7585 5 ай бұрын

Glad it was helpful! Thanks

@tarunpothala2071 Жыл бұрын

Hi sir, I was great explanation and good to see the practical implementation of it. But the only question is theoritically it was said that repartition will evenly distribute the data and coalesce will unevenly distribute the data. we it was practically implemented, I saw opposite results coalesce is taking evenly distrubuted values in two partitions but repartition doesn't. Can you please check ?

@tarunpothala2071 Жыл бұрын

Sorry just saw the below comments. will try with larger datasets

@rajasdataengineering7585 Жыл бұрын

Pls check with larger dataset and you can see the difference

@BaBa_Ji-x8b 2 ай бұрын

Please make video on liquid clustering..

@rajasdataengineering7585 2 ай бұрын

Sure will create soon

@maurifkhan3029 Жыл бұрын

QQ- The changes for default partition size will be at the cluster level or it will be only implemented for the notebook only. In case other jobs are running on cluster than will those also be impacted by the change in settings.

@kamalbhallachd 3 жыл бұрын

Helpful tips

@rajasdataengineering7585 3 жыл бұрын

Thank you Kamal

@ayushiagarwal528 8 ай бұрын

In example repartition produce uneven output for 2 partition but coalesce produce even result. Please explain??

@kalyanreddy496 Жыл бұрын

Good evening recently I came across with a question in capgemini client interview. Consider a scenario 2 gb of file is distributed in hadoop. After doing some transformations we got 10 dataframe. By applying the repartition(1) all the data is sits in one dataframe the dataframe size is 1.8 gb but your data node size is 1gb only. Does this 1.8 gb will sit in the data node or not. If yes how? Uf no what error it willbe given Requesting you sir please tell me the answer for this question

@avisinha2844 Жыл бұрын

hello sir, i have a small doubt, when we are supplying 3 separate files in a single df at 14:03 , then why the number of partitions is 3 , when the default partition size is 128 mb given the fact that the size of the df containing the 3 files is a lot less than 128 mb.

@rajasdataengineering7585 Жыл бұрын

I will post another video on this concept which will explain in detail

@raghavendarsaikumar Жыл бұрын

I see executors 1 and 2 in the picture before coalesce or repartition but post the action, I see both of them as executor 1. Is this pictorially wrong or does this operation reduces the num of executors as well.

@rajasdataengineering7585 Жыл бұрын

Good catch. It's pictorial mistake. Repartition or coalesce is nothing to do with number of executors

@suresh.suthar.24 Жыл бұрын

Hello Raja Sir, few days before i gave interview in that they asked a question like if we want to create 1 partition from multiple partition then which method you will choose coalesce or repartition ? i answered coalesce but they said we will use repartition. is it correct ?

@rajasdataengineering7585 Жыл бұрын

Hi Suresh, In this case, number of partitions needs to be reduced. Coalesce and repartition both can be used to reduce number of partitions but choosing one of them is highly depending on the use case. So you should have asked more input from the interviewer to understand the use case better. If so many transformations would be applied after resizing the partition, repartition would be better choice. Otherwise coalesce is better choice

@suresh.suthar.24 Жыл бұрын

@@rajasdataengineering7585 thanks 🙏

@AIFashionistaGuide Жыл бұрын

****************************** 1.Performance Tuning ***************************************** 1.Performance Optimization | Repartition vs Coalesce Performance Optimization | Repartition vs Coalesce --spark is know for its speed,speed comes from concept of parallel computing , parallel computing comes from repartition --partition is the key for parallel processing --if we design the partition ,automatically improves the performance -- hence partition plays an important role in error handling,debugging,performance --while partiotioning we must know 1.right size of partition done --scenario -2 partitons done 1000 MB,10 MB ,one with 10 MB will execute faster and remain idle which is not good. 2.right number of partitions -- scenario - we have 16 core executors, only 10 partitions created Then : 1.out of 16 cores , 10 cores will pick each partition Hence partitions cannot be shared among cores,6 cores are remaining idle. hence right number of partitions must be chosen as 6 are idle here. 2.choose 16 partitions or multiples of core available atleast.In 1rst iteration all 16 cores will pick 16 partitions and in 2nd iterations 16 cores will pick next 16 partitions.hence here no idle cores present. Spark.default.parallelism Spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Default value is 8, it creates 8 partitions by default. spark.sql.files.maxPartitionBytes When data is to be read from external tables,partitions are created on this above parameter. The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. Default size is 128 MB The above 2 parameters are configurable depending upon on your need. DataFrame.repartition() pyspark.sql.DataFrame.repartition() method is used to increase or decrease the RDD/DataFrame partitions by number of partitions or by single column name or multiple column names. This function takes 2 parameters; numPartitions and *cols, when one is specified the other is optional. repartition() is a wider transformation that involves shuffling of the data hence, it is considered an expensive operation. Key Points • repartition() is used to increase or decrease the number of partitions. • repartition() creates even partitions when compared with coalesce(). • It is a wider transformation. • It is an expensive operation as it involves data shuffle and consumes more resources. • repartition() can take int or column names as parameter to define how to perform the partitions. • If parameters are not specified, it uses the default number of partitions. • As part of performance optimization, recommends avoiding using this function. coalesce() --Spark DataFrame coalesce() is used only to decrease the number of partitions. --This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce(). --Coalesce() doesnot require a full shuffle as coalesce() combines few partitions or shuffles data only from few partitions thus avoiding full shuffle. --Due to partition merge it produces uneven size of partitions

@kalyanreddy496 Жыл бұрын

Good afternoon sir Requesting you to answer this question sir which I recently faced in interview sir please Consider you have read 1GB file into a dataframe. The max partition bytes configuration is set to 128MB. you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario? Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please

@rajasdataengineering7585 Жыл бұрын

The configuration 'maxPartitionBytes' is playing the role while ingesting data from external system into spark memory. Once data is loaded into spark memory, the partition size can vary according to various transformation and has nothing to do with maxPartitionBytes. So in this case, it wont through any error. Coalesce would produce unevenly distributed partitions, where repartition would create evenly distributed partitions in this case. Hope it clarifies your doubts. Thanks for sharing your interview experience. others can be benefitted in this community

@kalyanreddy496 Жыл бұрын

@@rajasdataengineering7585 thank you very much sir. I understand. If possible please do a video on this question sir. So we get more understanding visually sir. If possible please do it sir 🙏

@rajasdataengineering7585 Жыл бұрын

Sure Kalyan, will create a video on this requirement

@da8233 Жыл бұрын

thank you so much , its wonderful explanation

@rajasdataengineering7585 Жыл бұрын

Thank you

@vamsi.reddy1100 Жыл бұрын

one doubt...! when we have use repartition(2), then we got unevenly distributed partitions.'ie 8 in 1st partition and 2 in the other. but repartition should give us evenly distributed partition right? Please help me understand.

@rajasdataengineering7585 Жыл бұрын

Hi Vamsi, good question. Data is getting evenly distributed in repartition. Here we can see some differences because of small data set. From spark point of view, 2 rows or 8 rows are almost same. We can see the difference between repartition and coalesce while dealing with huge amount of data like billion or millions of rows

@vamsi.reddy1100 Жыл бұрын

@@rajasdataengineering7585 thank you for clarification..

@vamsi.reddy1100 Жыл бұрын

@@rajasdataengineering7585 your videos are so good...

@rajasdataengineering7585 Жыл бұрын

Thank you

@amiyaroy6789 3 ай бұрын

@@rajasdataengineering7585 had the same question, thank you for explaining!

@CoopmanGreg Жыл бұрын

👍

@rajasdataengineering7585 Жыл бұрын

👍🏻

@a2zhi976 Жыл бұрын

in the code i see sc.parallelieze (range(100),1) , where is the reference for sc ?.

@rajasdataengineering7585 Жыл бұрын

In databricks, spark context is implicit, no need to define separately

@shreyanvinjamuri Жыл бұрын

sc.defaultParalellism is for RDD's and wil only work with RDD ? spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame ?