Partitioning and bucketing in Spark | Lec-9

Partitioning and bucketing in Spark | Lec-9 | Practical video

Рет қаралды 28,749

Күн бұрын

In this video I have talked about how can you partition or bucket your transformed dataframe onto disk in spark. Please do ask your doubts in comment section.
Directly connect with me on:- topmate.io/man...
Data used in this tutorial:-
id,name,age,salary,address,gender
1,Manish,26,75000,INDIA,m
2,Nikita,23,100000,USA,f
3,Pritam,22,150000,INDIA,m
4,Prantosh,17,200000,JAPAN,m
5,Vikash,31,300000,USA,m
6,Rahul,55,300000,INDIA,m
7,Raju,67,540000,USA,m
8,Praveen,28,70000,JAPAN,m
9,Dev,32,150000,JAPAN,m
10,Sherin,16,25000,RUSSIA,f
11,Ragu,12,35000,INDIA,f
12,Sweta,43,200000,INDIA,f
13,Raushan,48,650000,USA,m
14,Mukesh,36,95000,RUSSIA,m
15,Prakash,52,750000,INDIA,m
For more queries reach out to me on my below social media handle.
Follow me on LinkedIn:- / manish-kumar-373b86176
Follow Me On Instagram:- / competitive_gyan1
Follow me on Facebook:- / manish12340
My Second Channel -- / @competitivegyan1
Interview series Playlist:- • Interview Questions an...
My Gear:-
Rode Mic:-- amzn.to/3RekC7a
Boya M1 Mic-- amzn.to/3uW0nnn
Wireless Mic:-- amzn.to/3TqLRhE
Tripod1 -- amzn.to/4avjyF4
Tripod2:-- amzn.to/46Y3QPu
camera1:-- amzn.to/3GIQlsE
camera2:-- amzn.to/46X190P
Pentab (Medium size):-- amzn.to/3RgMszQ (Recommended)
Pentab (Small size):-- amzn.to/3RpmIS0
Mobile:-- amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai)
Laptop -- amzn.to/3Ns5Okj
Mouse+keyboard combo -- amzn.to/3Ro6GYl
21 inch Monitor-- amzn.to/3TvCE7E
27 inch Monitor-- amzn.to/47QzXlA
iPad Pencil:-- amzn.to/4aiJxiG
iPad 9th Generation:-- amzn.to/470I11X
Boom Arm/Swing Arm:-- amzn.to/48eH2we
My PC Components:-
intel i7 Processor:-- amzn.to/47Svdfe
G.Skill RAM:-- amzn.to/47VFffI
Samsung SSD:-- amzn.to/3uVSE8W
WD blue HDD:-- amzn.to/47Y91QY
RTX 3060Ti Graphic card:- amzn.to/3tdLDjn
Gigabyte Motherboard:-- amzn.to/3RFUTGl
O11 Dynamic Cabinet:-- amzn.to/4avkgSK
Liquid cooler:-- amzn.to/472S8mS
Antec Prizm FAN:-- amzn.to/48ey4Pj

Пікірлер: 104

@MrAnshrockers Жыл бұрын

Haven't seen better videos for databricks on youtube, your dedication towards teaching each topic in deep is commendable brother, God bless

@manish_kumar_1 Жыл бұрын

Directly connect with me on:- topmate.io/manish_kumar25

@rajum9478 4 күн бұрын

something good i have seen in the following year thanks bro

@ChandanKumar-xj3md Жыл бұрын

Hey Manish, thanks a lot for these tutorials, I have seen almost all data engineer play list but the way you explained each topic so deep, I really appreciate and because of your video I was able to understand and ultimately cracked interviews. Thanks a lot and looking forward for more tutorials on this.

@manish_kumar_1 Жыл бұрын

Congratulations. BTW which company did you crack?

@ChandanKumar-xj3md Жыл бұрын

@@manish_kumar_1 in medical domain.

@SHUBHAMKUMAR-zv8ru Жыл бұрын

@@ChandanKumar-xj3md Congratulations Chandan, btw can you please tell the package how many years of experience do you have in Data Engineering domain

@shreemanthmamatavbal7468 Жыл бұрын

i have been following your course since last 10 days ..It was osm.

@PradyutJoshi 9 ай бұрын

This DE playlist is one of the best videos I have ever seen. Keep up the good work, Manish. Thank you for this 🙌

@ANILKUMAR-in3dp 20 күн бұрын

Bhai you are awesome teacher. Keep it up.

@bangalibangalore2404 Жыл бұрын

Spark series pura khatam kar lijiye fir design interview ka thora overview aur tips and tricks dena... Ya fir soch lijiye Walmart ya Product me jo managerial ya fir dusra round hota hai design wala, uska kuch question, resource, tips and tricks bataiyega to kaafi help ho jayega. Aur haan thank you so much spark series ke liye kaafi acha kaam kar rahe hai aap, dil se shukriya

@manish_kumar_1 Жыл бұрын

Sure

@sandeepnarwal8782 11 ай бұрын

Best video on partition on KZbin

@yogeshsangwan8343 Жыл бұрын

Great.. please continue

@RajvirKumar-n1p Жыл бұрын

greatbro acaha samjha rahe ho keep on

@sudipmukherjee6878 Жыл бұрын

Excellent..... please continue...

@vaidhyanathan07 10 ай бұрын

Good job buddy .. even the minute information performance impact also you are giving in details that's really great.. if you can do series in English version also that would be really great...

@rampal4570 Жыл бұрын

very nice explanation sir....and thank you so much for taking out your precious time to make this video for us :)

@mohitaggarwal_65 9 күн бұрын

17:47 BUCKETING

@satyamrai2577 Жыл бұрын

Very nice. Mai apka playlist use kr k apne next interview ke liye prepare kar rha hoon

@khadarvalli3805 Жыл бұрын

Tq so much Manish❤, clear cut explanation

@anveshkonda8334 4 ай бұрын

Thanks a lot Manish.. 🙇‍♂🙇‍♂

@aadil8409 9 ай бұрын

bhai tumne 2 videos mai yeh aadhar card data waala bataya hai, but dono video mai concept video se cut ho gya hai aadhe mai se, pls bataiye ki hum aadhar ko 10000 se kyu divide kr rhe hai. each bucket mai kitna data hoga and kitne no. of bucket lenge in case of bucketing on aadhar data.

@younevano 3 ай бұрын

Do you understand this better now? If yes, can you please explain?

@HimanshuSingh-yj2wh Ай бұрын

two tables are there with columns A.ID B.FID WHERE table bucket created A on A.ID and B on B.FID so if we perform join since the name of column is different can we get the advantage of the bucket in shuffling or not

@shreemanthmamatavbal7468 Жыл бұрын

sir please partition pruning and bucket pruning me confussion ho raha hai.kindly clear kijiye

@sahillohiya7658 Жыл бұрын

22:50 what if we repartition it to 5 and we are not able to store all the data on those executors

@KhaderAliAfghan Жыл бұрын

if you want to decrease the number of partition then u told should use colacese but while explaning you mentioned repartition ? and you mentioned that repartition to 5 and buckets are also 5 is it a good practice to have num of partitions = num of buckets??

@manish_kumar_1 Жыл бұрын

Repatriation can increase or decrease number of partitions but coalesce can only decrease. Coalesce doesn't make sure you will have evenly distributed data.

@khaderafghan1085 Жыл бұрын

@@manish_kumar_1 and if we have number of partition != number of bucket is it fine?

@mnshrm378 Жыл бұрын

Hi Manish, Could you please make a video on an introduction speech where 3.5 years of data engineering exp is involved with AWS cloud technology and give one or two examples of data Quality pipeline from IOT to s3 also include day 2-day activity. Thanks

@pravinyadav8372 Жыл бұрын

Manish Bhaiya aapne repartition(5).bucketBy(5, id) ki isse sirf 5 hi bucket banenge? I know repartition is a runtime re partitioning but agar repartition kar k kuch store kar rhe hai toh will it affect anything? Please repond. 200 * 5 = 1000 5 * 5 = 25 buckets banenge right?

@younevano 3 ай бұрын

Do you understand this better now? If yes, can you please explain?

@piyushjain5852 Жыл бұрын

Hi Manish, I couldn't understand how bucket avoided the shuffling here?

@manish_kumar_1 Жыл бұрын

Agar same id's same bucket me honge from both the df then you won't need to shuffle the data.

@mayanktripathi4u Жыл бұрын

Thanks for this awesome explanation, this make senses to me now, however I am still figuring out if we could use partition and bucket on same data frame. As the data frame I have has low cardinality thus used partition, however for one of the key-value the count of records are too high. For example, using the same employee data set from the video... partition is on Address /Country and for INDIA the record count is in Millions where as for other Country says its in thousands... now because one of the partition data got skewed. So should I use Bucketing or what approach should I use. Please suggest. Above said question was asked to me in one of the interview.

@soumyaranjanrout2843 Жыл бұрын

@mayanktripathi4u I think bucketing might be better approach because as we know if specific partition will be extremely large then it will reduce the query performance and it will expensive operation. Otherwise if we know about the data clearly then we should look for other column or combination of column with low cardinality and then apply partitionby. And we can also apply both partitionBy on "address" and bucketingBy on "gender" which may reduce the data size for each partition(but I am not sure whether this approach will be better or not) Correct me if I am wrong because I am a beginner.

@younevano 3 ай бұрын

@@soumyaranjanrout2843 Did you get more clarity on this now?

@younevano 3 ай бұрын

Did you get more clarity on this now?

@madhubanti123 9 ай бұрын

Hi Manish. For incremental load how partition and bucketing will work ? Will it recreate partition or buckets depend on incoming data ? Like partition = India already created and now new data for India is coming incrementally then what will happened to the partition ?

@rohitchakravarthi94 3 ай бұрын

i think in mode instead of overwrite we are supposed to use 'append', then it should get in incremental data written correctly

@younevano 3 ай бұрын

@@rohitchakravarthi94 Thanks for this! where do you study from?

@anoopkaur6119 6 ай бұрын

At 22:11, what does '200 tasks' refer to? does it mean the number of rows or something else?

@vipulbornare34 6 ай бұрын

You can consider it as rows/record also

@younevano 3 ай бұрын

Got clarity on this?

@sanjeev_kumar14 2 ай бұрын

when you run a spark job, the work is divided into several task, check the basic of Spark to understand this.

@nikhilhimanshu9758 Жыл бұрын

kya isi ko small file issue kehte hai ? wo jo 200 *5 = 1000 files create ho jata hai ?

@akashprajapati6635 Жыл бұрын

Sir ek doubt h ..ki ye data write kha pr ho rha h ......driver pr ya executer pr 😢

@manish_kumar_1 Жыл бұрын

🤔 spark to data bas process karta hai. Write to koi storage system like s3, hdfs, local, any server etc par hoga

@akashprajapati6635 Жыл бұрын

@@manish_kumar_1 sir aap abhi kha ki jio Branch me working ho 🙄

@manish_kumar_1 Жыл бұрын

@@akashprajapati6635 Gurgaon

@praveenkumarrai101 Жыл бұрын

bhai agar ham partiton kartai hai from original file in HDFS, then we have 1 full file and other 3 files due to partition. means same data 2 times in HDFS. Yahi concept hai na?

@younevano 3 ай бұрын

This is what I've found: When we repartition data in HDFS, the original data remains intact (in memory of course) while a new partitioned dataset is created. This means that, temporarily, we do have data in two places (in memory): the original files and the newly repartitioned files. Once we confirm that the new partitioned data is correct, we can delete the original data if it’s no longer needed. Otherwise, Spark clears the original files from memory once they’re no longer used or referenced. And you can just write to disk whichever you want!

@jay_rana 10 ай бұрын

Hi Manish, I did not understand the point where you said repartition(5) and bucketBy(5) will produce 5 files However in prev examples you said that it will generate 200 * 5 files ? Can you please explain this ??

@younevano 3 ай бұрын

He said 200*5 = 1000 files can be avoided by repartition(5) first and bucketBy(5) next, which will only produce 5 files/buckets.

@shaasif 11 ай бұрын

Hi manish how can we filter if we have two delimiter like | pipe and tab \t delimiter in our source file please explain this topic also thanks

@trex2498 2 ай бұрын

26:48 I think it should be 1cr?

@hankeepankee5361 8 ай бұрын

The dataframe in a partition doesn't have column that the data is partitioned on. Is this normal?

@lucky_raiser Жыл бұрын

bro, i am confused about 200 tasks and 5 bucket causing 1000 buckets , what do you mean by 200 tasks here?

@manish_kumar_1 Жыл бұрын

Watch stages and task wala video

@chandanbhardwaj6723 Жыл бұрын

One interview question, demonstrate how we can perform "spark-broadcast join" ? Not sure how to do this

@manish_kumar_1 Жыл бұрын

I will teach while learning join

@xx-pn7it 2 ай бұрын

what is bucketing in hive?

@navjotsingh-hl1jg Жыл бұрын

sir japan mein female nahi thi samjh gaya lekin japan male kyun nahi show ho rahe

@vikeshdas9630 Жыл бұрын

sir ek live doubt session rakhiye

@manish_kumar_1 Жыл бұрын

Bilkul

@lifewithkunbh Ай бұрын

.partitionBy("column_name") method in Spark does not create logical partitions

@divyanshusingh3966 4 ай бұрын

Vo 1000 files h ya task?

@villageviseshalu988 Жыл бұрын

Bhayya..u said for 200 tasks..if we want 5 buckets it will be 200×5 =1000. My doubt here is won't it take 40 records in each bucket?

@manish_kumar_1 Жыл бұрын

Number of record can or can't be same in each bucket. Based on pmod hash it send to respective bucket.

@rameshbayanavenkata1305 Жыл бұрын

@@manish_kumar_1 sir , please explain with an example where no. of records can't be same in each bucket.

@rp-zf3ci Жыл бұрын

After repartition(5), 5*5=25 buckets should get created right?

@manish_kumar_1 Жыл бұрын

Do you have 5 executor?

@praveenkumarrai101 Жыл бұрын

bhai ya 200 task wala concept nahu samagh aaya, ya 200 task ka kya matlab hua, in which sense?

@manish_kumar_1 Жыл бұрын

By default spark creates 200 partition when there is data shuffling involved like join or repartitioning or group by etc. And all the data is moved into either of partition based on pmod and murmur3 hashing

@praveenkumarrai101 Жыл бұрын

@@manish_kumar_1 ok bro, thanks

@Rakesh-if2tx Жыл бұрын

Bhai, How many remaining topics/videos are there in this spark series to complete it?

@manish_kumar_1 Жыл бұрын

@narag9802 Жыл бұрын

do you have the English version of lessons

@vikeshdas9630 Жыл бұрын

sir what is difference between pyspark and pandas in apache spark?

@manish_kumar_1 Жыл бұрын

Pandas doesn't work in distributed manner where as pyspark does

@SHUBHAMKUMAR-zv8ru Жыл бұрын

If you are dealing with large dataset, pyspark is better

@saumyasingh9620 Жыл бұрын

What is repartition? I have seen you using repartition(3).

@manish_kumar_1 Жыл бұрын

Repartition 3 means your data will be divided in to 3 part. Earlier it may have 20 part but Repartition 3 will make sure that you have 3 partition and also it will be of almost same size. No skewness in data

@bangalibangalore2404 Жыл бұрын

same location pe partition karke file daalenge to error kyun throw karta hai?

@manish_kumar_1 Жыл бұрын

Overwrite karne par v aa rha hai?

@Mdkaleem__ Жыл бұрын

bhai i'm getting this error "AnalysisException: Partition column `address` not found in schema struct" even though i tried to load the file again... however bucketby simply works but not partitionby

@manish_kumar_1 Жыл бұрын

Share me the code and schema in comment section or on LinkedIn

@vikeshdas9630 Жыл бұрын

sir if dataframe is only for structure and semistructure .so what should we use in case of unstructure data.

@manish_kumar_1 Жыл бұрын

Either through rdd or convert your unstructured into semi structured or structured by finding some pattern in your data

@nikhilrokade6025 4 күн бұрын

bhai tum har baar koi na koi steps batana bhool jate ho pichley 2 ghantes sey mey us piece of df for thik kar raah hu ho nhi raah thik sey structure banao yaar

@sauravojha6345 Жыл бұрын

bhai bihar me kaha se ho aap?

@manish_kumar_1 Жыл бұрын

Patna

@dipakchavan4659 9 ай бұрын

AttributeError: 'NoneType' object has no attribute 'Write' How I can solve this error df.write.format("csv")\ .option("header","true")\ .option("mode","overwrite")\ .option("path","/FileStore/tables/partition_by_address/")\ .partitionBy("address")\ .save()

@manish_kumar_1 9 ай бұрын

Kyunki aapka df jo hai usko aapne .show karke rakha hua hai. .show hone ke baad aapka df None ho jayega and NoneType error aayega

@abinashpandit5297 Жыл бұрын

Bar bar data write karte waqt object has no attribute 'write' q aa ja raha

@manish_kumar_1 Жыл бұрын

Kuch code snippet bhejiye. Aisa error kv aaya nahi hai mere ko

@Watson22j Жыл бұрын

bhaia agli video kb aaegi?

@manish_kumar_1 Жыл бұрын

It will take sometime

@Watson22j Жыл бұрын

@@manish_kumar_1 ye spark series kab tak khatam krne ka plan hai aapka??

@chandanpatra1053 9 ай бұрын

bhai aur details me samjhao.samajh me nahin aa raha hai. bhale hi apne code likh ke samjhaya hai. lekin samajh me nahin aya. Pehle to aapne Optimization technique kyun chahiye , wo nahin bataya. Jab hum terabytes/Petabytes of data ki baat karte hain. To us samay optimize kyun karna chahiye . Databricks toh computing resources provide kar raha hai. Aur jab ki spark large data handle karne ke liye banaya gaya hai. Phir optimization kyun? Aap ne video me kahin show nahin kiya ki agar hum x/y Terabytes of data read karenge without partitioning/bucketing kya problem ayega aur hum partitioning/bucketing concept use karenge to kya changes ayega. Mene apna feedback dia. Mene medium pe articles padhe, linkedin pe articles padhe, Guide to spark book bhi padha. lekin samajh me nahin aya. Jab me youtube me transformations & actions in spark search kia to bahut channels aya including apka channel . lekin jitne bhi channel aya wo mene ek ek karke dekha including yours. parantu spark ka ye jatil concept koi bhi aasan tarike se aur details me nahin bata saka. aap bhi nahin. Ek aur video ho sake to banaiye. Lage 1 hr/2hr/3hr lekin action/transformation pe jo cheez itna details me aap bataiye ki koi youtuber aaj tak nahin banaya hoga.

@biswabrataguharoy8578 3 ай бұрын

Arrey kehna kya chahte ho bhai??? Theory aur practical fully khatam kro tab sab samajh aa jayega.... chahe jitna bhi processing power ho spark k paas khamokha kyun kharcha badhaoge agar optimize krke bacha sakte ho????? Ultimately har company ka target hota hn humesha paisa bachana aur overhead ko kam krna taaki profit badhe.... to phir yeh kaisa sawaal hua ki optimization kyun??? agar kisi k paas arbo kharbo paisa hain uska matlab thodi na usko bekar ka kharachne hn bina matlab ka??? Billionaires financial planning nhi krte hn kya 🤣🤣

@rahulmadan9805 10 ай бұрын

Hi Manish Ji, I have executed below code reading_file_for_write_df.repartition(3).write.format("csv").mode("overwrite").option("header","true").option("path","/FileStore/tables/bucketby_id/").bucketBy(3,"id").saveAsTable("bucket_by_id") I am expecting it should part data into 3 part files, but I get 7 part files. Ypu said when we have multiple taskid it create multiple part files in that case. Without giving repartition value it gives 3 bucket files but with repartition(3) it give 7, why that sir? Could you please explain more on this. Also on .mode("overwrite") is it the correct way to pass ? I think by mistake you have given it wrong in video. Could you please confirm. Thanks O/p : FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00000-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-52-1_00000.c000.csv', name='part-00000-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-52-1_00000.c000.csv', size=89, modificationTime=1710875137000), FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00000-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-52-2_00001.c000.csv', name='part-00000-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-52-2_00001.c000.csv', size=87, modificationTime=1710875137000), FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00000-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-52-3_00002.c000.csv', name='part-00000-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-52-3_00002.c000.csv', size=59, modificationTime=1710875137000), FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00001-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-53-1_00000.c000.csv', name='part-00001-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-53-1_00000.c000.csv', size=106, modificationTime=1710875137000), FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00001-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-53-2_00002.c000.csv', name='part-00001-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-53-2_00002.c000.csv', size=90, modificationTime=1710875137000), FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00002-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-54-1_00000.c000.csv', name='part-00002-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-54-1_00000.c000.csv', size=143, modificationTime=1710875137000), FileInfo(path='dbfs:/FileStore/tables/bucketby_id/part-00002-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-54-2_00001.c000.csv', name='part-00002-tid-2977452082670177534-05219f91-66cc-4958-bbcc-b89d4d251389-54-2_00001.c000.csv', size=60, modificationTime=1710875137000)]

@manish_kumar_1 Жыл бұрын

Directly connect with me on:- topmate.io/manish_kumar25