Lot of information inside this video and much useful and understand the how to customize the partition Thank you
@rajasdataengineering75854 күн бұрын
Glad it was helpful! You are welcome!
@Aramakishore2 жыл бұрын
I have never seen any video elaborated like this..Appreciate you really..It understands very clearly
@rajasdataengineering75852 жыл бұрын
Thank you
@Akshaykumar-pu4vi2 жыл бұрын
Follow this playlist , it is tremendous sir and you provide concepts in a very good way. Thank you sir.
@rajasdataengineering75852 жыл бұрын
Thank you Akshay
@mynamesathish3 жыл бұрын
Nice explanation! In the mentioned example I can see the Repartiton(2) created partition of unequal size(one with 8 record and another with 2records), but I expect it to be of almost equal size.
@riyazalimohammad6332 жыл бұрын
@Sathish I also had the same doubt when watching the video. repartition(2) created partitions of unequal size but coalesce(2) had partitions with each 5 records per partition. Got me confused. @Raja sir, please clarify on the same.
@rajasdataengineering75852 жыл бұрын
@@riyazalimohammad633 Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@rajasdataengineering75852 жыл бұрын
@Sathish, Sorry for late reply. Your understanding is right. Repartition always creates evenly distributed partitions (as I explained in the video) whereas Coalesce produces unevenly distributed partitions. In this example, we used very simple (almost negligible size) dataset so we can not realize that. But when we work in actual big data projects, it is very evident to see this difference. Thanks for your comment
@riyazalimohammad6332 жыл бұрын
@@rajasdataengineering7585 Thank you for your prompt response! Much appreciated.
@somesh5122 жыл бұрын
I just watched the video and had the exact same doubt. But Raja Sir already provided the answer
@gurumoorthysivakolunthu9878 Жыл бұрын
Great, Sir... 1. What is the maximum value that can be set to - maxPartitionBytes.... 2. What parameters should be considered to decide the partitionbytes , repartition count... Thank you, Sir...
@avinash1722Ай бұрын
Very Informative. Way better then paid courses
@rajasdataengineering7585Ай бұрын
Thank you!
@vydudraksharam5960 Жыл бұрын
Raja Sir, Very well explained with example. I would like to know in the pictures you have given 2 executers for repartition and coalesce, but in the same picture you have shown output you named it as executer1 for both. is it by mistake or didn't i understood properly. Could you please clarify. this is difference is there in both the slides. -- Thank you Vydu
@rajasdataengineering7585 Жыл бұрын
Yes it is by mistake
@vipinkumarjha55873 жыл бұрын
Vey nice Video Sir, I clear all the basics doubt of Partitioning. Hope to see video on Optimizations approach like cache , persist, z order. Thanks again
@rajasdataengineering75853 жыл бұрын
Thank you Vipin. Sure, will post videos with optimization concepts such as cache, persist, Z order in delta etc.,
@shaileshsondawale2811 Жыл бұрын
Wow.. Wonderful Delivery sir...!!!! A wonder content
@rajasdataengineering7585 Жыл бұрын
Thanks Shailesh!
@phanisrikrishna Жыл бұрын
Hi sir, I was looking for a complete pyspark series with more emphasis on architecture and its components. I am having a good learning time with your KZbin series on pyspark. I was wondering if I can get the slides for this course which can help me in referring back quickly when attending interviews.
@vutv57426 ай бұрын
Great explaination...🎉🎉🎉
@rajasdataengineering75856 ай бұрын
Glad you liked it! Keep watching
@varun89522 жыл бұрын
Very detailed explanation, sir.
@rajasdataengineering75852 жыл бұрын
Thank you Varun
@gauthamn28446 ай бұрын
It was good session is there any indication keyword to set for increase or decrease partition in repartition?. Because repartition (20) how will we know its increased or decreased?. After execution only will come to know its increased/decreased.
@lokeshv434811 ай бұрын
At 5:30, There is a mention that snappy and gzip both are not splittable. But snappy is splittable and can have partitions.
@rajasdataengineering758511 ай бұрын
All snappy files are not splittable. Snappy with parquet/avro are splittable but snappy with json is not splittable. We can't generalise that all snappy files are splittable or non-splittable
@gulsahtanay23417 ай бұрын
Very helpful content, thank you!
@rajasdataengineering75857 ай бұрын
Glad it was helpful! You are welcome
@mrpoola49 Жыл бұрын
That was amazingly explained! You rock!
@rajasdataengineering7585 Жыл бұрын
Glad it was helpful!
@ririraman72 жыл бұрын
awesome tutorial
@rajasdataengineering75852 жыл бұрын
Thanks Ramandeep!
@kamalbhallachd3 жыл бұрын
Wow amazing
@rajasdataengineering75853 жыл бұрын
Thank you Kamal
@vidhyalakshmiparthasarathy8573 Жыл бұрын
Thank you so much sir for making such great videos. I'm learning a lot of nuances and best practices for practical applications.😊🙏
@rajasdataengineering7585 Жыл бұрын
Thank you for your comment! Happy to hear that these videos are helpful to you.
@vishalaaa1 Жыл бұрын
excellent
@rajasdataengineering7585 Жыл бұрын
Thanks Vishal! Glad you liked it
@arindamghosh37872 жыл бұрын
This is the video I was searching for .. thanks a lot ❤
@rajasdataengineering75852 жыл бұрын
Thanks Arindam!
@BeingSam7Күн бұрын
18:43 - When you decreased the number of partitions from 20 to 2 using repartition which is supposed to create evenly distributed data then why 1 partition contained 8 records and the other only 2, this is uneven data distribution?
@kamalbhallachd3 жыл бұрын
Really nice 👍
@rajasdataengineering75853 жыл бұрын
Thank you Kamal
@vedantbopardikar35078 ай бұрын
All credits to you sir
@rajasdataengineering75858 ай бұрын
Thank you! Hope it helps you gaining the knowledge
@robinshaw464110 ай бұрын
In real time scenario, when we will use coalsec and when repartiotion?
@sameludhanaraj5 ай бұрын
well explained.Thanks
@rajasdataengineering75855 ай бұрын
Glad it was helpful! Thanks
@tarunpothala2071 Жыл бұрын
Hi sir, I was great explanation and good to see the practical implementation of it. But the only question is theoritically it was said that repartition will evenly distribute the data and coalesce will unevenly distribute the data. we it was practically implemented, I saw opposite results coalesce is taking evenly distrubuted values in two partitions but repartition doesn't. Can you please check ?
@tarunpothala2071 Жыл бұрын
Sorry just saw the below comments. will try with larger datasets
@rajasdataengineering7585 Жыл бұрын
Pls check with larger dataset and you can see the difference
@BaBa_Ji-x8b2 ай бұрын
Please make video on liquid clustering..
@rajasdataengineering75852 ай бұрын
Sure will create soon
@maurifkhan3029 Жыл бұрын
QQ- The changes for default partition size will be at the cluster level or it will be only implemented for the notebook only. In case other jobs are running on cluster than will those also be impacted by the change in settings.
@kamalbhallachd3 жыл бұрын
Helpful tips
@rajasdataengineering75853 жыл бұрын
Thank you Kamal
@ayushiagarwal5288 ай бұрын
In example repartition produce uneven output for 2 partition but coalesce produce even result. Please explain??
@kalyanreddy496 Жыл бұрын
Good evening recently I came across with a question in capgemini client interview. Consider a scenario 2 gb of file is distributed in hadoop. After doing some transformations we got 10 dataframe. By applying the repartition(1) all the data is sits in one dataframe the dataframe size is 1.8 gb but your data node size is 1gb only. Does this 1.8 gb will sit in the data node or not. If yes how? Uf no what error it willbe given Requesting you sir please tell me the answer for this question
@avisinha2844 Жыл бұрын
hello sir, i have a small doubt, when we are supplying 3 separate files in a single df at 14:03 , then why the number of partitions is 3 , when the default partition size is 128 mb given the fact that the size of the df containing the 3 files is a lot less than 128 mb.
@rajasdataengineering7585 Жыл бұрын
I will post another video on this concept which will explain in detail
@raghavendarsaikumar Жыл бұрын
I see executors 1 and 2 in the picture before coalesce or repartition but post the action, I see both of them as executor 1. Is this pictorially wrong or does this operation reduces the num of executors as well.
@rajasdataengineering7585 Жыл бұрын
Good catch. It's pictorial mistake. Repartition or coalesce is nothing to do with number of executors
@suresh.suthar.24 Жыл бұрын
Hello Raja Sir, few days before i gave interview in that they asked a question like if we want to create 1 partition from multiple partition then which method you will choose coalesce or repartition ? i answered coalesce but they said we will use repartition. is it correct ?
@rajasdataengineering7585 Жыл бұрын
Hi Suresh, In this case, number of partitions needs to be reduced. Coalesce and repartition both can be used to reduce number of partitions but choosing one of them is highly depending on the use case. So you should have asked more input from the interviewer to understand the use case better. If so many transformations would be applied after resizing the partition, repartition would be better choice. Otherwise coalesce is better choice
@suresh.suthar.24 Жыл бұрын
@@rajasdataengineering7585 thanks 🙏
@AIFashionistaGuide Жыл бұрын
****************************** 1.Performance Tuning ***************************************** 1.Performance Optimization | Repartition vs Coalesce Performance Optimization | Repartition vs Coalesce --spark is know for its speed,speed comes from concept of parallel computing , parallel computing comes from repartition --partition is the key for parallel processing --if we design the partition ,automatically improves the performance -- hence partition plays an important role in error handling,debugging,performance --while partiotioning we must know 1.right size of partition done --scenario -2 partitons done 1000 MB,10 MB ,one with 10 MB will execute faster and remain idle which is not good. 2.right number of partitions -- scenario - we have 16 core executors, only 10 partitions created Then : 1.out of 16 cores , 10 cores will pick each partition Hence partitions cannot be shared among cores,6 cores are remaining idle. hence right number of partitions must be chosen as 6 are idle here. 2.choose 16 partitions or multiples of core available atleast.In 1rst iteration all 16 cores will pick 16 partitions and in 2nd iterations 16 cores will pick next 16 partitions.hence here no idle cores present. Spark.default.parallelism Spark.default.parallelism was introduced with RDD hence this property is only applicable to RDD. The default value for this configuration set to the number of all cores on all nodes in a cluster, on local, it is set to the number of cores on your system.For RDD, wider transformations like reduceByKey(), groupByKey(), join() triggers the data shuffling. Default value is 8, it creates 8 partitions by default. spark.sql.files.maxPartitionBytes When data is to be read from external tables,partitions are created on this above parameter. The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. Default size is 128 MB The above 2 parameters are configurable depending upon on your need. DataFrame.repartition() pyspark.sql.DataFrame.repartition() method is used to increase or decrease the RDD/DataFrame partitions by number of partitions or by single column name or multiple column names. This function takes 2 parameters; numPartitions and *cols, when one is specified the other is optional. repartition() is a wider transformation that involves shuffling of the data hence, it is considered an expensive operation. Key Points • repartition() is used to increase or decrease the number of partitions. • repartition() creates even partitions when compared with coalesce(). • It is a wider transformation. • It is an expensive operation as it involves data shuffle and consumes more resources. • repartition() can take int or column names as parameter to define how to perform the partitions. • If parameters are not specified, it uses the default number of partitions. • As part of performance optimization, recommends avoiding using this function. coalesce() --Spark DataFrame coalesce() is used only to decrease the number of partitions. --This is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce(). --Coalesce() doesnot require a full shuffle as coalesce() combines few partitions or shuffles data only from few partitions thus avoiding full shuffle. --Due to partition merge it produces uneven size of partitions
@kalyanreddy496 Жыл бұрын
Good afternoon sir Requesting you to answer this question sir which I recently faced in interview sir please Consider you have read 1GB file into a dataframe. The max partition bytes configuration is set to 128MB. you have applied the repartition(4) or coalesce (4) on the dataframe any of the methods will decrease the number of partitions.If you apply the repartition(4) or coalesce (4) Partition size gets increase >128MB . but the max Partition bytes is configured to 128MB. Does it throws any error (or) not throws any error? If it throws an error what is the error we will get when we execute the program? If not what is the behaviour of spark in this scenario? Could you tell me the answer for this question sir. Recently I faced this question. Requesting you sir please
@rajasdataengineering7585 Жыл бұрын
The configuration 'maxPartitionBytes' is playing the role while ingesting data from external system into spark memory. Once data is loaded into spark memory, the partition size can vary according to various transformation and has nothing to do with maxPartitionBytes. So in this case, it wont through any error. Coalesce would produce unevenly distributed partitions, where repartition would create evenly distributed partitions in this case. Hope it clarifies your doubts. Thanks for sharing your interview experience. others can be benefitted in this community
@kalyanreddy496 Жыл бұрын
@@rajasdataengineering7585 thank you very much sir. I understand. If possible please do a video on this question sir. So we get more understanding visually sir. If possible please do it sir 🙏
@rajasdataengineering7585 Жыл бұрын
Sure Kalyan, will create a video on this requirement
@da8233 Жыл бұрын
thank you so much , its wonderful explanation
@rajasdataengineering7585 Жыл бұрын
Thank you
@vamsi.reddy1100 Жыл бұрын
one doubt...! when we have use repartition(2), then we got unevenly distributed partitions.'ie 8 in 1st partition and 2 in the other. but repartition should give us evenly distributed partition right? Please help me understand.
@rajasdataengineering7585 Жыл бұрын
Hi Vamsi, good question. Data is getting evenly distributed in repartition. Here we can see some differences because of small data set. From spark point of view, 2 rows or 8 rows are almost same. We can see the difference between repartition and coalesce while dealing with huge amount of data like billion or millions of rows
@vamsi.reddy1100 Жыл бұрын
@@rajasdataengineering7585 thank you for clarification..
@vamsi.reddy1100 Жыл бұрын
@@rajasdataengineering7585 your videos are so good...
@rajasdataengineering7585 Жыл бұрын
Thank you
@amiyaroy67893 ай бұрын
@@rajasdataengineering7585 had the same question, thank you for explaining!
@CoopmanGreg Жыл бұрын
👍
@rajasdataengineering7585 Жыл бұрын
👍🏻
@a2zhi976 Жыл бұрын
in the code i see sc.parallelieze (range(100),1) , where is the reference for sc ?.
@rajasdataengineering7585 Жыл бұрын
In databricks, spark context is implicit, no need to define separately
@shreyanvinjamuri Жыл бұрын
sc.defaultParalellism is for RDD's and wil only work with RDD ? spark.sql.shuffle.partitions was introduced with DataFrame and it only works with DataFrame ?