Spark Interview Question | Bucketing | Spark SQL

  Рет қаралды 14,353

TechWithViresh

TechWithViresh

Күн бұрын

Пікірлер: 29
@vishalaaa1
@vishalaaa1 Жыл бұрын
nice
@eknathsatish7502
@eknathsatish7502 3 жыл бұрын
Excellent..
@TechWithViresh
@TechWithViresh 3 жыл бұрын
Thanks :)
@SpiritOfIndiaaa
@SpiritOfIndiaaa 2 жыл бұрын
Can you please the share the notebook URL please ? thanks a lot , really gr8 learnings
@bhushanmayank
@bhushanmayank 3 жыл бұрын
How does spark know that other table attribute is identical on which it is bucketed while joining?
@gauravbhartia7543
@gauravbhartia7543 4 жыл бұрын
Nicely Explained.
@TechWithViresh
@TechWithViresh 4 жыл бұрын
Thanks:)
@aashishraina2831
@aashishraina2831 3 жыл бұрын
excellent
@TechWithViresh
@TechWithViresh 3 жыл бұрын
Thanks :)
@RAVIC3200
@RAVIC3200 4 жыл бұрын
Again great content video, Viresh can you make video on those scenarios which interviewer usually ask like - 1) if you have 1TB of file how much time it takes to process (you take any standard cluster setup configuration to explain) and if i reduce to 500GB then how much time it will take. 2) DAG related scenarios questions ? 3) If spark job failed in middle then, will it start from starting if you re-trigger it again ? if not then why? 4) checkpoint related. Please try to cover such scenarios, if its inside one video then it will be really helpful.. thanks again for such videos.....
@TechWithViresh
@TechWithViresh 4 жыл бұрын
Thanks, don’t forget to subscribe.
@RAVIC3200
@RAVIC3200 4 жыл бұрын
@@TechWithViresh I'm your permanent viewer 🙏🙏
@cajaykiran
@cajaykiran 3 жыл бұрын
Is there anyway I can reach out to you to discuss something important?
@TechWithViresh
@TechWithViresh 3 жыл бұрын
Send the details at techwithviresh@gmail.com.
@dipanjansaha6824
@dipanjansaha6824 4 жыл бұрын
When we directly write to adls i.e the files then how bucketing helps? 2. Also is that a correct understanding bucketing is good when we use a datafram for read purpose only.. as what I understood if there's a use case where in every build write operation happens.. bucketing would not be the best approach..
@TechWithViresh
@TechWithViresh 4 жыл бұрын
Yes, bucketing is more effective in reusable tables involved in heavier joins
@cajaykiran
@cajaykiran 3 жыл бұрын
Thank you
@gunishjha4030
@gunishjha4030 3 жыл бұрын
Great content!!!, You have used bucketBy in scala code to do the changes, can you tell how to handle the same in spark sql as well. do we have any function we can pass in spark sql for the same.
@gunishjha4030
@gunishjha4030 3 жыл бұрын
found it thanks anyways PARTITIONED BY (favorite_color) CLUSTERED BY(name) SORTED BY (favorite_numbers) INTO 42 BUCKETS;
@mdfurqan
@mdfurqan Жыл бұрын
@@gunishjha4030 but are u able to insert the data in bucketed table using spark-sql underlaying storage is Hive?
@sachink.gorade8209
@sachink.gorade8209 4 жыл бұрын
Hello Viresh sir, Nice explaination. Just one thing I did not understand when we create 8 partitions for these two tables as I could not find any code for it in video. So could you please explain?
@TechWithViresh
@TechWithViresh 4 жыл бұрын
8 is the default partitions(round robin) created for the cluster used here with 8 nodes.
@TechWithViresh
@TechWithViresh 4 жыл бұрын
8 is the default number of partitions (round robin) as the cluster used has 8 nodes
@mateen161
@mateen161 4 жыл бұрын
Nice explanation!...Just wondering how the number of buckets should be decided. In this example, you had used 4 buckets, can't we use 6 or 8 or 10. Is there a specific reason for using 4 buckets ?
@TechWithViresh
@TechWithViresh 4 жыл бұрын
It can be any number, depending on your data and bucket column
@himanshusekharpaul476
@himanshusekharpaul476 4 жыл бұрын
Hey ..Nice explanation ..But here i have one doubt ... in above vedio you have given no of bucket is 4 . What are the criteria we should keep in mind while deciding no of bucket in real time.?? Is there any formula or bucket size constraints ??? Could you please help ??
@TechWithViresh
@TechWithViresh 4 жыл бұрын
The idea behind these two data distribution techniques- partition and bucket is to have data distribution evenly and in such optimum size , which can be effectively processed in a single task
@himanshusekharpaul476
@himanshusekharpaul476 4 жыл бұрын
Ok.. What is the optimum bucket size that can be processed by single task??
@aashishraina2831
@aashishraina2831 3 жыл бұрын
i think this video is reapted above. can be deleted.
Spark Interview Questions | Spark Context Vs Spark Session
9:26
TechWithViresh
Рет қаралды 19 М.
75. Databricks | Pyspark | Performance Optimization - Bucketing
22:03
Raja's Data Engineering
Рет қаралды 20 М.
Don’t Choose The Wrong Box 😱
00:41
Topper Guild
Рет қаралды 56 МЛН
Мясо вегана? 🧐 @Whatthefshow
01:01
История одного вокалиста
Рет қаралды 7 МЛН
UFC 310 : Рахмонов VS Мачадо Гэрри
05:00
Setanta Sports UFC
Рет қаралды 1,2 МЛН
Spark Performance Tuning | Handling DATA Skewness | Interview Question
16:08
35.  Join Strategy in Spark with Demo
33:48
CloudFitness
Рет қаралды 15 М.
Partition vs Bucketing | Data Engineer interview
8:53
learn by doing it
Рет қаралды 10 М.
Spark Performance Tuning | EXECUTOR Tuning | Interview Question
18:19
TechWithViresh
Рет қаралды 32 М.
Partition vs bucketing | Spark and Hive Interview Question
9:15
Data Savvy
Рет қаралды 100 М.
Bucketing - The One Spark Optimization You're Not Doing
35:04
Afaque Ahmad
Рет қаралды 9 М.
Spark  - Repartition Or  Coalesce
10:02
Data Engineering
Рет қаралды 20 М.
The 25 SQL Questions You MUST Know for Data Analyst Interviews
32:47
KSR Datavizon
Рет қаралды 249 М.
Hive Bucketing in Apache Spark - Tejas Patil
25:17
Databricks
Рет қаралды 25 М.
Don’t Choose The Wrong Box 😱
00:41
Topper Guild
Рет қаралды 56 МЛН