Advancing Spark - Data Lakehouse Star Schemas with Dynamic Partition Pruning!

  Рет қаралды 10,456

Advancing Analytics

Advancing Analytics

4 жыл бұрын

Hot on the heels of last week's Spark & AI Summit announcement, Simon is digging into the new features of Spark 3.0. In this episode, we're looking at Dynamic Partition Pruning, which should dramatically speed up queries over partitioned data!
Not sure about partitioning? Don't know why you should care? Watch now!
Don't forget to like & subscribe for more sparky goodness, and check out the Advancing Analytics blog for more content! www.advancinganalytics.co.uk/...

Пікірлер: 19
@krishnakomanduri9432
@krishnakomanduri9432 3 жыл бұрын
Hey! I have been watching KZbin videos since ages but this is the first time I am commenting on a KZbin video. Your content is awesome and I mean it! Just buy a professional mic and a HD camera and never stop making videos like this one. I'd love to see more practical demonstrations on your channel. Good job!
@ConnorRoss311
@ConnorRoss311 4 жыл бұрын
Great video! Can’t wait either for git project
@loganboyd
@loganboyd 3 жыл бұрын
Really like your videos. KZbin is NOT the best source for good detailed Spark content but watching videos is better than reading :) We are moving to Cloudera CDP from a HDP platform in the next couple of months. Spark 3.0 and it's new features look cool and should be very helpful. Am I understanding the DPP feature correctly if I said, it's only going to provide partition pruning when these two things are true: 1. you have a predicate on a column on a smaller dimension table that is joined to a larger fact table 2. the join key from the fact table side is an existing partitioned column
@AdvancingAnalytics
@AdvancingAnalytics 3 жыл бұрын
Hey Logan - yep, I believe that's correct. This means you'll need to have tied your partitioning strategy to a foreign key of some sort to get maximum benefit from this approach, otherwise you'll never be hitting it... that said, I'm now questioning myself, I'll have a quick play over the next couple of days and confirm that it's only when the join key is your partition column. Lemme get back to you with a definitive! Simon
@karol2614
@karol2614 2 жыл бұрын
@@AdvancingAnalytics Do you have any answer to question Logan ?
@divyanshjain6679
@divyanshjain6679 3 жыл бұрын
Hi! I have gone through with AQE video & found it very interesting. Coming to DPP, I'm totally new to Delta Lake n don't know much about the concept. Can u please share the block of code you have used to load data to Delta Table. Also, which databricks datset have u loaded as I can see multiple folders inside "nyctaxi" dataset. Thanks
@karol2614
@karol2614 2 жыл бұрын
what is the best partitioning strategy for star schema warehouse? There are big facts in this structure that are related to a large number of dimensions - partitioning after one connection, queries will be suboptimal when using another key.
@gardnmi
@gardnmi 2 жыл бұрын
I'm not sure if there have been updates to how spark handles data partitioning since this video but when I tried out your example on a delta table it actually managed to filter the date partition using the calculated date field within the fact table (See below). However, when I tested it with a non date dimension that was partitioned such as organization_id and filtering for organizational_name it was not able to filter the partitions so the dynamic partitioning join with a organizational_dim table outperformed the filter in the fact table. PartitionFilters: [isnotnull(service_from_date#299881), (date_format(cast(service_from_date#299881 as timestamp), y...,
@mohitsanghai5455
@mohitsanghai5455 3 жыл бұрын
Great Video...Just have few questions - u applied filter on dimension table Date and spark filters the data, converts it into hash table and broadcast it. At the same time it applied partition pruning on Fact table Sales and only pick up the required records. Does it broadcast those records as well? Does subquery broadcast means those records? What if the filtered data is also huge? Will Spark still broacast it? or use SortMerge Join in that case.
@AdvancingAnalytics
@AdvancingAnalytics 3 жыл бұрын
Broadcast join just means that one of the two joining tables is small enough to be broadcast. So if one side of the join is huge, each worker will only have the rdd blocks it needs, but it will pull a whole copy of the smaller table onto each worker so that it can satisfy all joins. If both sides of the query are huge, then yeah it'll revert to a SortMerge etc, but at least it will still have pushed the partition filter back down to the file system Simon
@mohitsanghai5455
@mohitsanghai5455 3 жыл бұрын
@@AdvancingAnalytics Thanks for clearing some doubts...But what was subquery broadcast ?
@EvgenAnufriev
@EvgenAnufriev Жыл бұрын
Could you share your opinion on if the Data Vault methodology is good for implementing it using Databricks Spark and/or. Spark Streaming (Azure Cloud), Delta tables? Data size is in tens of GB/ TBs
@ravisamal3533
@ravisamal3533 3 жыл бұрын
Hey can you index your spark videos playlist
@adrestrada1
@adrestrada1 3 жыл бұрын
Hi Simon, Do you ve Github to start following you! ?
@AdvancingAnalytics
@AdvancingAnalytics 3 жыл бұрын
Not really! I have a git account for slides/demos from conference talks, but the examples on KZbin are all very quick & hardcoded to my env. We're looking at ways of sharing the notebooks in a more sustainable way!
@flixgpt
@flixgpt 3 жыл бұрын
Your accent is irritating and even the subtitle is not able to pick up .. it's hard to follow
@Advancing_Terry
@Advancing_Terry 3 жыл бұрын
If you press ALT+F4, KZbin will change the accent. It's a cool feature
@bittu007ize
@bittu007ize 2 жыл бұрын
@@Advancing_Terry awesome feature
@curiouslycally
@curiouslycally Ай бұрын
your comment is irritating
Advancing Spark - Bloom Filter Indexes in Databricks Delta
24:41
Advancing Analytics
Рет қаралды 8 М.
Behind the Hype - The Medallion Architecture Doesn't Work
21:51
Advancing Analytics
Рет қаралды 26 М.
ВОДА В СОЛО
00:20
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 28 МЛН
Smart Sigma Kid #funny #sigma #comedy
00:26
CRAZY GREAPA
Рет қаралды 19 МЛН
Dynamic Partition Pruning in Apache Spark
9:32
Learning Journal
Рет қаралды 13 М.
What is Dynamic Partition Pruning in Spark
13:18
BigData Thoughts
Рет қаралды 4 М.
Advancing Spark - Give your Delta Lake a boost with Z-Ordering
20:31
Advancing Analytics
Рет қаралды 27 М.
Advancing Spark - Rethinking ETL with Databricks Autoloader
21:09
Advancing Analytics
Рет қаралды 26 М.
Advancing Spark - Understanding the Spark UI
30:19
Advancing Analytics
Рет қаралды 50 М.
Optimising Geospatial Queries with Dynamic File Pruning
24:59
Advancing Spark - Databricks Delta Streaming
20:07
Advancing Analytics
Рет қаралды 28 М.
Databricks, Delta Lake and You
48:02
SQLBits
Рет қаралды 19 М.
ВОДА В СОЛО
00:20
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 28 МЛН