35. Databricks & Spark: Interview Question

35. Databricks & Spark: Interview Question - Shuffle Partition

Рет қаралды 19,608

Күн бұрын

Пікірлер: 37

@ranjansrivastava9256 9 ай бұрын

Raja pls make clear here:- The default number of partitions for the RDD/Dataset is 8 and the default partition size is 128 MB On the other hand the default partition for the shuffling partition is 200 and size is 128 MB as well. It means shuffling partition is applied on the worker node and RDD/Dataset partitions will be implemented on the driver node. Please share your inputs on this.

@antonyvinothans6735 24 күн бұрын

05:18 , so which means cores and partitions, both are same ?

@rajasdataengineering7585 24 күн бұрын

Core is computing unit where partition is splitted data unit

@antonyvinothans6735 24 күн бұрын

@@rajasdataengineering7585 Thanks

@gustavorocha9774 2 жыл бұрын

Thank u! excellent content!

@rajasdataengineering7585 2 жыл бұрын

Thank you Gustavo!

@dhananjaymali6242 2 жыл бұрын

Your content is really good 👍😇

@rajasdataengineering7585 2 жыл бұрын

Thank you Dhananjay

@ajaykiranchundi9979 2 жыл бұрын

Excellent Raja

@abhaybisht101 2 жыл бұрын

Great content Raja. Make one detailed video on Spark performance and optimizations.

@rajasdataengineering7585 2 жыл бұрын

Thanks Abhay. Sure, will make a series of videos on performance optimization

@abhaybisht101 2 жыл бұрын

Thanks Raja 🤟

@PavanKumar-tt8mm 2 жыл бұрын

@@rajasdataengineering7585 waiting for the series Raja.

@PavanKumar-tt8mm 2 жыл бұрын

Expecting some More concepts on Pyspark Raja. Good Effort

@rajasdataengineering7585 2 жыл бұрын

Sure Pavan, will do

@omprakashreddy4230 2 жыл бұрын

So,If I understand correctly after you set some value for shuffle partitions and later after shuffling if you don't get expected performance then we go with "repartition or coalesce " right?

@rajasdataengineering7585 2 жыл бұрын

Hi Omprakash, good question. Partition plays important role in three phases in any spark application. 1.While reading the data into spark environment (adls, S3 etc.,) 2. While writing the data into target location of storage account 3. While performing certain transformations Repartition (coalesce, maxbytesperfile, maxrowsperfile are other options) is mainly used while reading and writing, not for transformation. So when you don't get the desired performance boost from shuffle parameter, still you need to find the better shuffle partition number. And also there are numerous performance optimization techniques apart from shuffle parameter. So based on the use case, we can apply other applicable methods as well. Hope it helps

@omprakashreddy4230 2 жыл бұрын

Thanks Raja for such a detailed explanation.

@aayushisaxena4549 Жыл бұрын

I have 500 gb output dataframe with no aggregate or joins, and it needs to be written to a table, will repartition or shuffle operations improve parallelism?

@rajasdataengineering7585 Жыл бұрын

No, it will hit the performance. Repartition or shuffling the data is suitable only if we have many transformations after the shuffling instead of just writing

@aayushisaxena4549 Жыл бұрын

Thank you.

@aayushisaxena4549 Жыл бұрын

Thank you for the prompt reply. In my situation, how should I improve the write time to a table? Will increasing any parallelism help?

@sravankumar1767 2 жыл бұрын

Nice explanation 👌

@rajasdataengineering7585 2 жыл бұрын

Thank you Sravan

@sumitkumarsingh6554 5 ай бұрын

well explained, kudos.

@rajasdataengineering7585 5 ай бұрын

Thank you! Keep watching

@gurumoorthysivakolunthu9878 Жыл бұрын

Hi Sir... Shuffling Parameter -- is just the count, right... The count - number of partitions that can be shuffled between executors or Nodes between stages... This is not the size of the partition, right Sir...? Please help... Also, please make a video about executors, drivers, tasks... i. e. Full life cycle Or flow of How spark job is executed ... Thank you, Sir...

@rajasdataengineering7585 Жыл бұрын

Shuffle partition is number of partitions that will be created when there is shuffling process, it is not partition size . Sure will post a video with full life cycle

@gurumoorthysivakolunthu9878 Жыл бұрын

@@rajasdataengineering7585 thank you, Sir.... Sir, If we enable, will Adoptive Query Executor take care of this feature / functionality...?

@rajasdataengineering7585 Жыл бұрын

Yes, AQE will apply coalesce wherever needed

@gurumoorthysivakolunthu9878 Жыл бұрын

@@rajasdataengineering7585 Great, Sir... Thank you for your kind heart in helping in our doubts... Thank you, Sir....

@realMujeeb 11 ай бұрын

Awesome

@rajasdataengineering7585 11 ай бұрын

Thanks

@BlingKing321 Жыл бұрын

Why there will be disk and network overhead for small file. Even for big file disk and network overhead will be there

@shwetankagrawal4253 2 жыл бұрын

how spark.sql.shuffle.partition works here whcih is 200 by default because partition should be same as number of unique value in a key.

@shubhamyadav-vd9gv 2 жыл бұрын

If we will do the repartition based on the key, then it will give us the partition based on unique value in a key.

@sotos47 6 ай бұрын

Nice question, i was wondering too, since shuffling puts data per unique key in a partition , the number of partitions after shuffling should be equal to the number of unique key values. How does 200 partitions work?