Raja pls make clear here:- The default number of partitions for the RDD/Dataset is 8 and the default partition size is 128 MB On the other hand the default partition for the shuffling partition is 200 and size is 128 MB as well. It means shuffling partition is applied on the worker node and RDD/Dataset partitions will be implemented on the driver node. Please share your inputs on this.
@antonyvinothans673524 күн бұрын
05:18 , so which means cores and partitions, both are same ?
@rajasdataengineering758524 күн бұрын
Core is computing unit where partition is splitted data unit
@antonyvinothans673524 күн бұрын
@@rajasdataengineering7585 Thanks
@gustavorocha97742 жыл бұрын
Thank u! excellent content!
@rajasdataengineering75852 жыл бұрын
Thank you Gustavo!
@dhananjaymali62422 жыл бұрын
Your content is really good 👍😇
@rajasdataengineering75852 жыл бұрын
Thank you Dhananjay
@ajaykiranchundi99792 жыл бұрын
Excellent Raja
@abhaybisht1012 жыл бұрын
Great content Raja. Make one detailed video on Spark performance and optimizations.
@rajasdataengineering75852 жыл бұрын
Thanks Abhay. Sure, will make a series of videos on performance optimization
@abhaybisht1012 жыл бұрын
Thanks Raja 🤟
@PavanKumar-tt8mm2 жыл бұрын
@@rajasdataengineering7585 waiting for the series Raja.
@PavanKumar-tt8mm2 жыл бұрын
Expecting some More concepts on Pyspark Raja. Good Effort
@rajasdataengineering75852 жыл бұрын
Sure Pavan, will do
@omprakashreddy42302 жыл бұрын
So,If I understand correctly after you set some value for shuffle partitions and later after shuffling if you don't get expected performance then we go with "repartition or coalesce " right?
@rajasdataengineering75852 жыл бұрын
Hi Omprakash, good question. Partition plays important role in three phases in any spark application. 1.While reading the data into spark environment (adls, S3 etc.,) 2. While writing the data into target location of storage account 3. While performing certain transformations Repartition (coalesce, maxbytesperfile, maxrowsperfile are other options) is mainly used while reading and writing, not for transformation. So when you don't get the desired performance boost from shuffle parameter, still you need to find the better shuffle partition number. And also there are numerous performance optimization techniques apart from shuffle parameter. So based on the use case, we can apply other applicable methods as well. Hope it helps
@omprakashreddy42302 жыл бұрын
Thanks Raja for such a detailed explanation.
@aayushisaxena4549 Жыл бұрын
I have 500 gb output dataframe with no aggregate or joins, and it needs to be written to a table, will repartition or shuffle operations improve parallelism?
@rajasdataengineering7585 Жыл бұрын
No, it will hit the performance. Repartition or shuffling the data is suitable only if we have many transformations after the shuffling instead of just writing
@aayushisaxena4549 Жыл бұрын
Thank you.
@aayushisaxena4549 Жыл бұрын
Thank you for the prompt reply. In my situation, how should I improve the write time to a table? Will increasing any parallelism help?
@sravankumar17672 жыл бұрын
Nice explanation 👌
@rajasdataengineering75852 жыл бұрын
Thank you Sravan
@sumitkumarsingh65545 ай бұрын
well explained, kudos.
@rajasdataengineering75855 ай бұрын
Thank you! Keep watching
@gurumoorthysivakolunthu9878 Жыл бұрын
Hi Sir... Shuffling Parameter -- is just the count, right... The count - number of partitions that can be shuffled between executors or Nodes between stages... This is not the size of the partition, right Sir...? Please help... Also, please make a video about executors, drivers, tasks... i. e. Full life cycle Or flow of How spark job is executed ... Thank you, Sir...
@rajasdataengineering7585 Жыл бұрын
Shuffle partition is number of partitions that will be created when there is shuffling process, it is not partition size . Sure will post a video with full life cycle
@gurumoorthysivakolunthu9878 Жыл бұрын
@@rajasdataengineering7585 thank you, Sir.... Sir, If we enable, will Adoptive Query Executor take care of this feature / functionality...?
@rajasdataengineering7585 Жыл бұрын
Yes, AQE will apply coalesce wherever needed
@gurumoorthysivakolunthu9878 Жыл бұрын
@@rajasdataengineering7585 Great, Sir... Thank you for your kind heart in helping in our doubts... Thank you, Sir....
@realMujeeb11 ай бұрын
Awesome
@rajasdataengineering758511 ай бұрын
Thanks
@BlingKing321 Жыл бұрын
Why there will be disk and network overhead for small file. Even for big file disk and network overhead will be there
@shwetankagrawal42532 жыл бұрын
how spark.sql.shuffle.partition works here whcih is 200 by default because partition should be same as number of unique value in a key.
@shubhamyadav-vd9gv2 жыл бұрын
If we will do the repartition based on the key, then it will give us the partition based on unique value in a key.
@sotos476 ай бұрын
Nice question, i was wondering too, since shuffling puts data per unique key in a partition , the number of partitions after shuffling should be equal to the number of unique key values. How does 200 partitions work?