Advancing Spark - Data Lakehouse Star Schemas with Dynamic Partition Pruning!

Рет қаралды 10,737

Күн бұрын

Пікірлер: 19

@krishnakomanduri9432 4 жыл бұрын

Hey! I have been watching KZbin videos since ages but this is the first time I am commenting on a KZbin video. Your content is awesome and I mean it! Just buy a professional mic and a HD camera and never stop making videos like this one. I'd love to see more practical demonstrations on your channel. Good job!

@ConnorRoss311 4 жыл бұрын

Great video! Can’t wait either for git project

@gardnmi 3 жыл бұрын

I'm not sure if there have been updates to how spark handles data partitioning since this video but when I tried out your example on a delta table it actually managed to filter the date partition using the calculated date field within the fact table (See below). However, when I tested it with a non date dimension that was partitioned such as organization_id and filtering for organizational_name it was not able to filter the partitions so the dynamic partitioning join with a organizational_dim table outperformed the filter in the fact table. PartitionFilters: [isnotnull(service_from_date#299881), (date_format(cast(service_from_date#299881 as timestamp), y...,

@divyanshjain6679 3 жыл бұрын

Hi! I have gone through with AQE video & found it very interesting. Coming to DPP, I'm totally new to Delta Lake n don't know much about the concept. Can u please share the block of code you have used to load data to Delta Table. Also, which databricks datset have u loaded as I can see multiple folders inside "nyctaxi" dataset. Thanks

@loganboyd 4 жыл бұрын

Really like your videos. KZbin is NOT the best source for good detailed Spark content but watching videos is better than reading :) We are moving to Cloudera CDP from a HDP platform in the next couple of months. Spark 3.0 and it's new features look cool and should be very helpful. Am I understanding the DPP feature correctly if I said, it's only going to provide partition pruning when these two things are true: 1. you have a predicate on a column on a smaller dimension table that is joined to a larger fact table 2. the join key from the fact table side is an existing partitioned column

@AdvancingAnalytics 4 жыл бұрын

Hey Logan - yep, I believe that's correct. This means you'll need to have tied your partitioning strategy to a foreign key of some sort to get maximum benefit from this approach, otherwise you'll never be hitting it... that said, I'm now questioning myself, I'll have a quick play over the next couple of days and confirm that it's only when the join key is your partition column. Lemme get back to you with a definitive! Simon

@karol2614 2 жыл бұрын

@@AdvancingAnalytics Do you have any answer to question Logan ?

@karol2614 2 жыл бұрын

what is the best partitioning strategy for star schema warehouse? There are big facts in this structure that are related to a large number of dimensions - partitioning after one connection, queries will be suboptimal when using another key.

@EvgenAnufriev 2 жыл бұрын

Could you share your opinion on if the Data Vault methodology is good for implementing it using Databricks Spark and/or. Spark Streaming (Azure Cloud), Delta tables? Data size is in tens of GB/ TBs

@mohitsanghai5455 3 жыл бұрын

Great Video...Just have few questions - u applied filter on dimension table Date and spark filters the data, converts it into hash table and broadcast it. At the same time it applied partition pruning on Fact table Sales and only pick up the required records. Does it broadcast those records as well? Does subquery broadcast means those records? What if the filtered data is also huge? Will Spark still broacast it? or use SortMerge Join in that case.

@AdvancingAnalytics 3 жыл бұрын

Broadcast join just means that one of the two joining tables is small enough to be broadcast. So if one side of the join is huge, each worker will only have the rdd blocks it needs, but it will pull a whole copy of the smaller table onto each worker so that it can satisfy all joins. If both sides of the query are huge, then yeah it'll revert to a SortMerge etc, but at least it will still have pushed the partition filter back down to the file system Simon

@mohitsanghai5455 3 жыл бұрын

@@AdvancingAnalytics Thanks for clearing some doubts...But what was subquery broadcast ?

@ravisamal3533 3 жыл бұрын

Hey can you index your spark videos playlist

@adrestrada1 3 жыл бұрын

Hi Simon, Do you ve Github to start following you! ?

@AdvancingAnalytics 3 жыл бұрын

Not really! I have a git account for slides/demos from conference talks, but the examples on KZbin are all very quick & hardcoded to my env. We're looking at ways of sharing the notebooks in a more sustainable way!