Hash Partitioning vs Range Partitioning | Spark Interview questions

Рет қаралды 21,576

Күн бұрын

This video is part of the Spark learning Series. Spark provides different methods to optimize the performance of queries. So As part of this video, we are covering the following
What is Partitioning
Hash Partitioning
Range Partitioning
Why we should choose the right way of partitioning
How Spark's performance is impacted by Dynamic Partition Pruning
Here are a few Links useful for you
Git Repo: github.com/har...
Spark Interview Questions: • Spark Interview Questions
Spark performance tuning:
If you are interested to join our community. Please join the following groups
Telegram: t.me/bigdata_hkr
Whatsapp: chat.whatsapp....
You can drop me an email for any queries at
aforalgo@gmail.com
#apachespark #sparktutorial #bigdata
#spark #hadoop #spark3

Пікірлер: 48

@sarfarazhussain6883 4 жыл бұрын

Default nature: If we use YARN then the number of partitions = number of blocks. In local or standalone mode, the number of partitions can be maximum the number of cores available in the system.

@DataSavvy 4 жыл бұрын

Thanks :)

@kiranmudradi26 4 жыл бұрын

@@DataSavvy and @Sarfaraz. If Spark is reading from non distributed file systems other than HDFS. What would be default/initial number of partitions and partition size?

@DataSavvy 4 жыл бұрын

In spark 2.x I think it is 4 partitions... In spark 3.x it is 6

@kiranmudradi26 4 жыл бұрын

@@DataSavvy thanks.

@amitprasad8114 3 жыл бұрын

Your explanation creates and very good mind mapping. Thank you!

@saikumarmora6409 4 жыл бұрын

While reading the data it depends upon the file size. By default the partition size is 128MB so if we have an input file of 10*128MB the it'll be divided into 10 partitions. Also we can use the spark.sql.files.maxPartitionBytes to set the partition size. Please correct me if I'm wrong

@DataSavvy 4 жыл бұрын

Your understanding is right

@big-bang-movies Жыл бұрын

only high level theory covered behind partitioning. was expecting some hands on.

@yardy88 2 жыл бұрын

Very well explained! Thank you! 😊

@anubhavkarelia9585 3 жыл бұрын

By Default: Partition size is 128MB, So When file is read in spark it automatically calculate by File_Size//128, and divide the partition accordingly. We can also change partition size in spark by changing config.

@Laughrider 4 жыл бұрын

Spark decides the number of partitions on the basis of block size .I am not sure but please do answer this questions I have been asked in an interview

@sachingajwa8839 2 жыл бұрын

Spark uses the default partitioning when it reads the data from file. Default partition partitioned the data based on size of file and it create a partition for each 128 mb of data.

@vijeandran 3 жыл бұрын

Very informative...

@mohankrishna4593 3 жыл бұрын

Hi Sir. All your channel videos are very helpful for us. Thanks a lot for the amazing content. Could you please answer this question? How many initial partitions spark creates when we read table/view from some data source like Oracle, Snowflake,SAP etc?

@hishailesh77 3 жыл бұрын

Spark decides number of partition based on combination of various factors viz. 'Default parallelism' usually equal to number of cores, total number of files and size of each file , min partition size (default 128 MB). Given below two scenarios . a) 54 parquet files, 63 MB each, No. of core equal to 10 , min partition size=128 Total partition = 54 . As split size = 63 MB + 4 MB (openCostInBytes ) = 67 MB . So we can fit only one split into one partition b) 54 parquet files, 38 MB each, No. of core equal to 10 , min partition size=128 Total partition = 18 . As split size = 38 MB + 4 MB (openCostInBytes ) = 42 MB . So we can fit Three split into one partition (128/42). Apart from this if we specify set spark default parallelism to very high , then it will also affect the number of partition and we would get different number for above scenarios (Will do the math later). BTW, thanks for putting this series , its really helpful .

@aneksingh4496 4 жыл бұрын

We have to provide number of partitions let's say in repartition () manually and then invoke partitionBy or else Spark will take from default partition size spark.sql.partition which is 200

@DataSavvy 4 жыл бұрын

These are ways to enforce a number manually. Otherwise spark will create one partition per core when it is writing a new file. In case spark is reading a new file it will be based on hdfs blocks

@sadeeshkumar1484 3 жыл бұрын

If the file is from hdfs then by default block size 128 mb number of partitions will be considered. If it is from local then by default block size 64 mb is taken as block size and according to that number of partitions will be considered. Correct me if I'm wrong .

@veerap3878 2 жыл бұрын

Is there a difference in reading the data in Hive using the HiveContext and using JDBC driver. when to use jdbc driver and HiveContext ?

@MrVivekc 3 жыл бұрын

partitions while reading file (Total file size)/(128 MB)

@srinivasasameer9615 4 жыл бұрын

Spark choose by default block size of HDFS, number of cores we are passing through spark submit in local [ ].. spark.sql.shuffle is by default 200. Not default partition is 200. Hope I am right. Correct me if I am wrong

@ahyanroboking9237 2 жыл бұрын

In another session you mention that reading large partition file an cause OutOfMemory error in executor but in these discussions it is considered as block size of 128 MB is considers as partition while spark reading it ? then how large partition file is reason for executor OutOfMemory ?

@ammeejurinaveenkumar6874 5 ай бұрын

If you have a very large file and you're not explicitly repartitioning it in Spark, Spark will likely create only a few partitions to process the data. If these partitions are too large, they might not fit into the memory of individual executor nodes, leading to OutOfMemory errors. For example, if you have a 10 GB file and Spark decides to create only 2 partitions, each partition would be approximately 5 GB in size. If your executor nodes have limited memory (which is often the case in distributed environments), trying to process a 5 GB partition might exceed the memory capacity of the executor, leading to an OutOfMemory error. Here in your case, if the large partition file is having 30GB data, and you allocated only 10 cores/tasks to run each partition, while loading the file to dataframe it cause OOM error. Hope you doubt got resolved now.:)

@vamshi878 4 жыл бұрын

Hi, partitionBy doesn't perform shuffle. will data move across nodes?

@DataSavvy 4 жыл бұрын

I meant when u repartition data

@ishansharma4276 3 жыл бұрын

spark decides the number of partitions based on the key. so if there are 4 kind of keys let us yat x_1,y_1,z_1,t_1 then there will be 4 partitions of the file

@pardeep657 3 жыл бұрын

is it memory partitinong or disk partitioning technique? isnt memory partitioning is costly in itself?

@Capt_X 3 жыл бұрын

Thank you for making it so simple to understand! How can we we distribute 8 gb of records evenly after filter and joining it on another dataset. I see different number of partitions in different stages in last job in AppMaster when I perform an action of Saving df into csv. Will this problem be solved by increasing number of executers/executer memory/driver memory?

@DataSavvy 3 жыл бұрын

Hi... M sorry , I did not understand your question properly

@sreenivasmekala6198 3 жыл бұрын

Hi Sir, How will decide hash code of the record in hash partitioning

@bhanukumarsingh6272 3 жыл бұрын

spark will decide no of partition based on no of blocks of the files.

@rajdeepsinghborana2409 4 жыл бұрын

Sir , is there any online lab ( platform) for practicing big data Hadoop free 👨🏻‍💻

@DataSavvy 4 жыл бұрын

You can use databricks community edition for practise

@vsandeep06 4 жыл бұрын

Num of partitions depends on total num of cores in worker nodes

@DataSavvy 4 жыл бұрын

This statement is not true always... Rather that only represents that how many parallel tasks can get executed

@adityakvs3529 12 күн бұрын

How hashcode decided

@DataSavvy 11 күн бұрын

Hashcode is calculated using hash algorithm

@2chandra 4 жыл бұрын

Spark partition depends on the no. of cores

@DataSavvy 4 жыл бұрын

Right... When spark is writing data it depends on cores... What about when spark is reading a new file?

@2chandra 4 жыл бұрын

@@DataSavvy Spark normally sets the partition automatically based on cluster. However we can manually set the partition.

@MrManish389 4 жыл бұрын

@@DataSavvy , While reading the data --> (File size/Block size(128 mb)) . Kindly correct me if i am wrong.

@aneksingh4496 4 жыл бұрын

But how spark will decide which partitioner to choose from ?

@DataSavvy 4 жыл бұрын

That depends on nature of transformation... U can also force spark to prefer certain transformation

@nandepusrinivas6746 3 жыл бұрын

@@DataSavvy Can you elaborate how to force spark to prefer transforamtion..do we have any docs to dig deeper into that

@stevehe5713 3 жыл бұрын

you didn't explain the context correctly. I think you meant the shuffle partition strategy.

@arupdaw5193 3 жыл бұрын

The wapp group is full and kickd me out of the group