Partition vs bucketing | Spark and Hive Interview Question

  Рет қаралды 100,521

Data Savvy

Data Savvy

Күн бұрын

Пікірлер: 94
@alibinmazi452
@alibinmazi452 4 жыл бұрын
small file problem in Hadoop? According to me if we have lots of small files in cluster that will increase burden on namenode . bcoz namenode stores the meta data of file so if we have lots of small files name node keep noting address of files and hence if master down cluster also gone down.
@DataSavvy
@DataSavvy 4 жыл бұрын
That is right... In addition to this spark will also need to create more executor tasks... This will create unnecessary overhead and slow down your data processing
@saurabhgulati2505
@saurabhgulati2505 3 жыл бұрын
Also if these files are compressed, the executor core will get busy decompressing them.
@tanmaydash803
@tanmaydash803 Жыл бұрын
name node ?
@-leaflet
@-leaflet Жыл бұрын
@@tanmaydash803 otherwise called the Master
@Khang-lt4gk
@Khang-lt4gk Ай бұрын
Question 1 at 3:15: Issues with many small files on Hadoop. - Resource utilization problem: Each task is assigned to process data in a single partition. Multiple small files -> multiple small partitions -> multiple tasks are required -> multiple tasks are in queue -> high frequency for context switching -> high load on driver node (for allocation and orchestrating tasks among executors and cores) -> high possibility for driver OOM. - .metadata file (responsible for storing address of compressed partition files) consists of many key pairs to map -> low shuffle efficiency for almost every transformations.
@cajaykiran
@cajaykiran 3 жыл бұрын
I would have watched this video at least 5 times between yesterday and today. Thank you very
@anujtirkey9867
@anujtirkey9867 4 ай бұрын
Same 😂
@sashikiran9
@sashikiran9 3 жыл бұрын
Important point - hive partitioning is not same as Spark partitioning. 7:34-9:14
@r.kishorekumar1388
@r.kishorekumar1388 2 жыл бұрын
Where there are lot of small files in hadoop, the namenode performance can be impact because of unable to fast process the data.. Actually Hadoop is for handling big data.. So creating too many small files may end up with namenode performance impact. I came across this problem in my project
@bharathraj4545
@bharathraj4545 10 ай бұрын
Hi bro iam new to big data can you guide me further
@DataSavvy
@DataSavvy 10 ай бұрын
Hi Bharath, happy to guide you. Drop me an email on aforalgo@gmail.com
@sky-i8d
@sky-i8d Ай бұрын
Hadoop is generally for big data. So for storage the minimum block size is 128MB, so having such a small file can significantly waste storage as atleast one block will be assigned to one file. Please correct me if I am wrong here.
@ShashankGupta347
@ShashankGupta347 2 жыл бұрын
crisp & clear , Thanks !
@sumit_ks
@sumit_ks 3 жыл бұрын
Very well explained sir.
@DataSavvy
@DataSavvy 3 жыл бұрын
Thanks Sumit :)
@FaizanAli-we5wc
@FaizanAli-we5wc Жыл бұрын
You are too good sir thank you soo much for clearing our concepts❤
@vutv5742
@vutv5742 11 ай бұрын
Nice explanation ❤ Completed ❤
@DataSavvy
@DataSavvy 11 ай бұрын
Thanks
@tanushreenagar3116
@tanushreenagar3116 Жыл бұрын
Best explanation
@DataSavvy
@DataSavvy Жыл бұрын
Thanks for liking
@vamshi878
@vamshi878 4 жыл бұрын
@data savvy, i obesrved in my local system with multiple cores, partitionBy and bucketBy both doesn't perform any shuffle, there is no exchange in plan. That is why it is producing small files in both cases? Is that right? Will it perform shuffle in large cluster? I am jts reading from a file and writing in partitionby or bucket by no transformations, tell me in this case cluster level also no shuffle will be there?
@khanmujahid4743
@khanmujahid4743 3 жыл бұрын
It uses hash value of the search item and go to the bucket which matches with the hash value
@rakeshdey1702
@rakeshdey1702 4 жыл бұрын
This is nice explanation, But you are considering physical partition for hive , but memory level partition for spark to show difference no of files generated
@subhajitroy5850
@subhajitroy5850 4 жыл бұрын
Really appreciate @Data Savvy for the effort. I have a question: The data searching/retrieval process in case of partitioned table can (to create an analogy) we understand, the way element retrieval is done in binary tree and in case of partitioned bucketed table, a way search is done in nested binary tree . I am referring to Binary tree in Data structure Recently, I followed one Mock Bigdata Interview video in your channel,liked a lot. If possible please upload a few more such videos. Thanks :)
@DataSavvy
@DataSavvy 4 жыл бұрын
Hi Subhajit... Thanks. More mock interviews are planned in next few weeks.. excuse me but I did not get your question :(
@subhajitroy5850
@subhajitroy5850 4 жыл бұрын
@@DataSavvy The way data is retrieved / searched in partitioned hive table, can we think / correlate the same with that of element retrieve in case of binary tree (Binary Tree in Data Structure). Not sure if this is a better version :)
@raviranjan217
@raviranjan217 3 жыл бұрын
Small file problem is headache to name node since it has to manage metadata info. also spark need more number of executor which is again a overhead .
@sanketkhandare6430
@sanketkhandare6430 2 жыл бұрын
excellent explaination. helped a lot
@prosperakwo7563
@prosperakwo7563 4 жыл бұрын
Thanks for the great video, very clear explanation
@punpompur
@punpompur Ай бұрын
Wouldn't it be possible for data in buckets to be skewed as well? Does the hash function ensure that each bucket will be the same size?
@shikhargupta7552
@shikhargupta7552 2 жыл бұрын
Please keep making more such videos. Also would be great if you could make something for cloud related big data technologies
@DataSavvy
@DataSavvy 11 ай бұрын
Thanks Shikhar, I will plan to create videos on cloud. Do u need videos on any specific topic on cloud?
@anikethdeshpande8336
@anikethdeshpande8336 Жыл бұрын
is bucketing not used with save() method ? it works fine with saveAsTable() getting this error AnalysisException: 'save' does not support bucketBy and sortBy right now.
@ksktest187
@ksktest187 3 жыл бұрын
Great efforts ,keep it up
@jonathasrocha6480
@jonathasrocha6480 2 жыл бұрын
Does Bucketing is used when the column have high cardinality ?
@saurabhgarud6690
@saurabhgarud6690 4 жыл бұрын
Thanks for a very helpful video. My question here is, how we can perform optimisation using bucketing,? As in bucketing data is shuffled among different buckets, so it will not be sorted, so if i am using where condition over bucketed table how should i avoid irrelevant bucket scans like i do in partitioning? In short does where condition optimises bucketed table if not then what are other optimisations over bucketing ?
@HemanthKumardigital
@HemanthKumardigital 2 жыл бұрын
Thank you so much sir ☺️ .
@rajlakshmipatil4415
@rajlakshmipatil4415 4 жыл бұрын
No of bucket in spark = size of data /128 Iam I correct so in that case as above we can't specify no of buckets in spark ? In which case should we go for bucketing and which case should go for partitioning can you give some example ?
@DataSavvy
@DataSavvy 4 жыл бұрын
If u use partitioning and it creates small files, then u should consider using bucketing there...
@rajlakshmipatil4415
@rajlakshmipatil4415 4 жыл бұрын
@@DataSavvy Thanks for answering
@kaladharnaidusompalyam851
@kaladharnaidusompalyam851 4 жыл бұрын
I ll tell you one thing here. Partitioning is done based upon the column & bucketing is done based upon the rows. (i.e., both concepts are splitting data into multiple pieces. But part based on column and buck based on rows/records.) Suppose if we have data 1-100 .we can bucket data like 1-25 in one bucket and 25 -50 in second bucket and 50-75 &75-100respectively. Based on rows. But partation is based on column. Ex. If you have column name (population in year wise from 2010-2020) we split data based on year wise . 2010 ,2011,2012...2020into 10 partations. If it is 100%correct .please comment some one. Dont feel bad. If im wrong i make it correct. Tq
@DataSavvy
@DataSavvy 4 жыл бұрын
Partitioning and bucketing both are done one column... only diff is , How the records are grouped. I think your statement is right but u are viewing these concepts in more complex way..
@DataSavvy
@DataSavvy 4 жыл бұрын
Thanks Rajlakshmi :)
@anurodhpatil4776
@anurodhpatil4776 Жыл бұрын
excellent
@anandraj2558
@anandraj2558 4 жыл бұрын
Nice. explanation.. Can you please also take Hive join example map side join and all other joins and performance tuning.
@DataSavvy
@DataSavvy 4 жыл бұрын
Sure will create videos on that
@ayushjain139
@ayushjain139 4 жыл бұрын
How can I find if my bucketing was really utilized by the query? Can be visible from the physical plan? Also, I am believing that in the case of partition+bucketing, both the partition and bucket filters should be on my query?
@kumarsatyachaitanyayedida4717
@kumarsatyachaitanyayedida4717 2 жыл бұрын
How can we consider a particular column to use as partitioning or to use as bucketing
@vikramrajsahu1962
@vikramrajsahu1962 3 жыл бұрын
Can we increase the performance of the Hive query while fetching the records, assuming table is already partitioned?
@uditmittal3816
@uditmittal3816 2 жыл бұрын
Thanks for the video. But i have one query ,how to insert data in bucketed table of hive using spark. I tried this, but it didn't give correct output.
@bhavaniv1721
@bhavaniv1721 4 жыл бұрын
Hi,r u handling spark and scala training classes?
@rajeshp3323
@rajeshp3323 3 жыл бұрын
but what i herd is in spark 1 partition = 1 block size.... partitions are not created like in hive using specific column name again here in spark when comes to bucketing..as u said 1 bucket should be minimum of block size....so is it mean 1 bucket = 1 partition...then what is the need of bucketing in spark...im confused
@xxxxxxxxxxa232
@xxxxxxxxxxa232 2 жыл бұрын
Partitioning and bucketing are similar to GROUP BY ... and WHERE value in a range
@bhooshan25
@bhooshan25 Жыл бұрын
very useful
@kketanbhaalerao
@kketanbhaalerao Жыл бұрын
without partitioning can we directly do bucketing in spark?
@sambitkumardash9585
@sambitkumardash9585 4 жыл бұрын
Sir, could you please give one example syntactically between Hive partition, bucketing vs spark partition, bucketing . And couldn't understand the last point of your summary, could you please give some more clarity on it .
@DataSavvy
@DataSavvy 4 жыл бұрын
Let me look into that
@Apna_Banaras
@Apna_Banaras 3 жыл бұрын
Small file problem in hadoop? Its generates lot's of metadata . Than its increase the burden of name node
@kaladharnaidusompalyam851
@kaladharnaidusompalyam851 4 жыл бұрын
Hi Harjeet, i have came across a question in my latest interview. what are the packages we need when we want to impliment spark?
@DataSavvy
@DataSavvy 4 жыл бұрын
Hi... It depends on what dependencies are u using in your project... Check you sbt file
@sagarbalai1122
@sagarbalai1122 3 жыл бұрын
If you already have some project then check in sbt/ pom file but generally you need atleast spark-core, spark-sql to start with basic ops.
@selvansenthil1
@selvansenthil1 Жыл бұрын
How can we make bucket size to 128 mb as partion size would be 128 mb which will further devided into buckets.
@engineerbaaniya4846
@engineerbaaniya4846 Ай бұрын
is it correct to say that partitioning will create multiple folders while bucketing will create multiple files in spark
@kaladharnaidusompalyam851
@kaladharnaidusompalyam851 4 жыл бұрын
what kind of problems we will face when there are a lot of small files in hadoop? My ans is : Hadoop is meant for handling large size of files in less number. i.e , hadoop can handle big size files with less count. hadoop wont give better results in efficient way for lot of small files, because there sould be SEEK time for reading data from hard disk to fetch a record . this would increase if you use lot of small files, it will increase system down time. and more over meta data also increases.
@DataSavvy
@DataSavvy 4 жыл бұрын
Thats Right :) . There will be few more issues. Please see pinned message
@likithaguntha8105
@likithaguntha8105 3 жыл бұрын
Can we partition after bucketing?
@routhmahesh9525
@routhmahesh9525 3 жыл бұрын
How can we decide the number of buckets in case after partitioning one file 128 mb ,2nd file 400mb ,3rd file 200 mb..kindly answer..thanks in advance
@gyan_chakra
@gyan_chakra 2 жыл бұрын
Sir better quality is not available for this video. Please fix it.
@DataSavvy
@DataSavvy 2 жыл бұрын
Hi Bhumitra...I am working on fixing this
@nobinstren3798
@nobinstren3798 4 жыл бұрын
thanks men its help
@DataSavvy
@DataSavvy 4 жыл бұрын
Thanks Nobin. Pleasure... :)
@sandipsawant7525
@sandipsawant7525 4 жыл бұрын
Thanks for this video. One question, in which kind of cases we need to use only bucketing , and how query search happens ? Thanks again🙏
@DataSavvy
@DataSavvy 4 жыл бұрын
When partition on a column will create small files, use bucketing without partition.. before doing sort merge join also u can create buckted table and improve performance of join
@sandipsawant7525
@sandipsawant7525 4 жыл бұрын
@@DataSavvy Thank you sir for answer. If I used 4 buckets, when I hit select query then it will go to only one specific bucket or it will search in all buckets? Because in partition we have folders with value, in case of bucketing, how query will know , in which bucket to search ?
@AtifImamAatuif
@AtifImamAatuif 4 жыл бұрын
@@sandipsawant7525 It will use " hash value" of the search item and go to the bucket , which matches with the "hash value "
@sandipsawant7525
@sandipsawant7525 4 жыл бұрын
@@AtifImamAatuif Thanks
@ayushjain139
@ayushjain139 4 жыл бұрын
@@DataSavvy "before doing sort merge join also u can create buckted table and improve performance of join" - Kindly explain how and why?
@ramchundi2816
@ramchundi2816 3 жыл бұрын
Thanks, Harjeet. It was a great explanation. Quick question for you - What will happen if we remove a partition key after loading the data (in managed and external tables)?
@nikhithapolanki
@nikhithapolanki 3 жыл бұрын
How can u remove partition key once table is created? If u drop and recreate table without partition, data present in physical location of table cannot be read by table. It will give parsing exception
@dheemanjain8205
@dheemanjain8205 11 ай бұрын
partition is same as group By and Bucketing is same as range
@DataSavvy
@DataSavvy 11 ай бұрын
Hi, it's actually different...
@krunalgoswami4654
@krunalgoswami4654 3 жыл бұрын
I like it
@alokdaipuriya4607
@alokdaipuriya4607 3 жыл бұрын
Hi Harjeet..... Thanks for such informative video. One qq here U choose country column for partition that's ok And u choose age column for buckets. So here why did u choose age column for bucketing ? Why not Name column ? Or we can choose any from name and age or there is some technicality behind to choose bucketing column ? If yes plz do comment.
@saketmulay8353
@saketmulay8353 2 жыл бұрын
it depends on the filter you want to apply, if you want to apply filter on age and you are bucketing by name, then the problem will remain as it is and it won't make any sense.
@GreatIndia1729
@GreatIndia1729 2 жыл бұрын
IF we have large Number of Small files, then Number of I/O operations...like opening & closing files will be increased. This is Performance Issue.
@mohitmehta3788
@mohitmehta3788 4 жыл бұрын
If we want to query the table for country= india and age=20. Now that we have create new bucketed table, do we have to query bucketed table or initial table. Little lost here.
@DataSavvy
@DataSavvy 4 жыл бұрын
u will query bucketed table :)
@NN-sw4io
@NN-sw4io 3 жыл бұрын
Sir, What if the filter only by age? So how about the partition and bucket?
@Ady_Sr
@Ady_Sr Жыл бұрын
Volume of data would increase if we have small file.. Volume can be alot of small files or few large files.. both are No No
@sivakrishna3413
@sivakrishna3413 4 жыл бұрын
I want to learn Spark and pyspark. Are you providing any training?
@DataSavvy
@DataSavvy 4 жыл бұрын
Hi Siva... I am currently to perusing any online training... Let me look into this prospect
Hive Bucket End to End Explained
19:55
Data Engineering
Рет қаралды 29 М.
Как Я Брата ОБМАНУЛ (смешное видео, прикол, юмор, поржать)
00:59
Spark Basics | Partitions
5:13
Palantir Developers
Рет қаралды 19 М.
Partition vs Bucketing | Data Engineer interview
8:53
learn by doing it
Рет қаралды 10 М.
Spark Session vs Spark Context | Spark Internals
8:08
Data Savvy
Рет қаралды 73 М.
75. Databricks | Pyspark | Performance Optimization - Bucketing
22:03
Raja's Data Engineering
Рет қаралды 20 М.
Hive Partition [ Static Vs Dynamic]
18:34
Data Engineering
Рет қаралды 40 М.
Apache Spark Memory Management | Unified Memory Management
7:18
Repartition vs Coalesce | Spark Interview questions
4:10
Data Savvy
Рет қаралды 41 М.