Hey! I have been watching KZbin videos since ages but this is the first time I am commenting on a KZbin video. Your content is awesome and I mean it! Just buy a professional mic and a HD camera and never stop making videos like this one. I'd love to see more practical demonstrations on your channel. Good job!
@ConnorRoss3114 жыл бұрын
Great video! Can’t wait either for git project
@gardnmi3 жыл бұрын
I'm not sure if there have been updates to how spark handles data partitioning since this video but when I tried out your example on a delta table it actually managed to filter the date partition using the calculated date field within the fact table (See below). However, when I tested it with a non date dimension that was partitioned such as organization_id and filtering for organizational_name it was not able to filter the partitions so the dynamic partitioning join with a organizational_dim table outperformed the filter in the fact table. PartitionFilters: [isnotnull(service_from_date#299881), (date_format(cast(service_from_date#299881 as timestamp), y...,
@divyanshjain66793 жыл бұрын
Hi! I have gone through with AQE video & found it very interesting. Coming to DPP, I'm totally new to Delta Lake n don't know much about the concept. Can u please share the block of code you have used to load data to Delta Table. Also, which databricks datset have u loaded as I can see multiple folders inside "nyctaxi" dataset. Thanks
@loganboyd4 жыл бұрын
Really like your videos. KZbin is NOT the best source for good detailed Spark content but watching videos is better than reading :) We are moving to Cloudera CDP from a HDP platform in the next couple of months. Spark 3.0 and it's new features look cool and should be very helpful. Am I understanding the DPP feature correctly if I said, it's only going to provide partition pruning when these two things are true: 1. you have a predicate on a column on a smaller dimension table that is joined to a larger fact table 2. the join key from the fact table side is an existing partitioned column
@AdvancingAnalytics4 жыл бұрын
Hey Logan - yep, I believe that's correct. This means you'll need to have tied your partitioning strategy to a foreign key of some sort to get maximum benefit from this approach, otherwise you'll never be hitting it... that said, I'm now questioning myself, I'll have a quick play over the next couple of days and confirm that it's only when the join key is your partition column. Lemme get back to you with a definitive! Simon
@karol26142 жыл бұрын
@@AdvancingAnalytics Do you have any answer to question Logan ?
@karol26142 жыл бұрын
what is the best partitioning strategy for star schema warehouse? There are big facts in this structure that are related to a large number of dimensions - partitioning after one connection, queries will be suboptimal when using another key.
@EvgenAnufriev2 жыл бұрын
Could you share your opinion on if the Data Vault methodology is good for implementing it using Databricks Spark and/or. Spark Streaming (Azure Cloud), Delta tables? Data size is in tens of GB/ TBs
@mohitsanghai54553 жыл бұрын
Great Video...Just have few questions - u applied filter on dimension table Date and spark filters the data, converts it into hash table and broadcast it. At the same time it applied partition pruning on Fact table Sales and only pick up the required records. Does it broadcast those records as well? Does subquery broadcast means those records? What if the filtered data is also huge? Will Spark still broacast it? or use SortMerge Join in that case.
@AdvancingAnalytics3 жыл бұрын
Broadcast join just means that one of the two joining tables is small enough to be broadcast. So if one side of the join is huge, each worker will only have the rdd blocks it needs, but it will pull a whole copy of the smaller table onto each worker so that it can satisfy all joins. If both sides of the query are huge, then yeah it'll revert to a SortMerge etc, but at least it will still have pushed the partition filter back down to the file system Simon
@mohitsanghai54553 жыл бұрын
@@AdvancingAnalytics Thanks for clearing some doubts...But what was subquery broadcast ?
@ravisamal35333 жыл бұрын
Hey can you index your spark videos playlist
@adrestrada13 жыл бұрын
Hi Simon, Do you ve Github to start following you! ?
@AdvancingAnalytics3 жыл бұрын
Not really! I have a git account for slides/demos from conference talks, but the examples on KZbin are all very quick & hardcoded to my env. We're looking at ways of sharing the notebooks in a more sustainable way!
@flixgpt3 жыл бұрын
Your accent is irritating and even the subtitle is not able to pick up .. it's hard to follow
@Advancing_Terry3 жыл бұрын
If you press ALT+F4, KZbin will change the accent. It's a cool feature