Behind the Hype - The Medallion Architecture Doesn't Work

  Рет қаралды 29,640

Advancing Analytics

Advancing Analytics

Күн бұрын

Since the dawn of the Data Lake, we've been arguing about what to call the different layers that we move data through. The Medallion Architecture has become supremely popular in recent years, and with Microsoft Fabric embracing the Lakehouse approach, it's never had more attention that it does currently!
But there's also a huge amount of confusion - what do the different layers mean? What processes do you do where? How do you translate the entirely of your data curation into just three steps... and does it always make sense?
In this video, Simon pulls apart the expected journey of data through a lakehouse architecture and maps the medallion architecture to the deeper, richer approach that has become the Advancing Analytics standard.
If you're setting out on a journey to build out a Medallion Architecture, whether in Databricks or Microsoft Fabric - get in touch with the experts from Advancing Analytics

Пікірлер: 108
@ParkerWoodson-ow3ol
@ParkerWoodson-ow3ol 7 ай бұрын
Thank you!!! Finally, someone talks about the Medallion Architecture as a framework and not the rule of you need a bronze, silver, and gold layer to your data lakehouse. For years I always received that basic explanation as the rule and "Best Practice" even from all of the so called experts. After learning from experience with implementations it's nice to hear affirmation - it's what you decide is appropriate to the context of your team and data product to determine the various landing zones of our data. You provide a good way to communicate that without just saying "It depends" which it does and nobody likes that answer 😂
@matthewbradley2873
@matthewbradley2873 10 ай бұрын
A key point omitted here is that the medallion architecture is also referred to as a 'multi-hop' architecture. I always talk about a multi-hop design approach with medallion as a starting point. In data warehousing we used to talk about 'staging'. This was never a single hop transformation to the dimensional model, but many smaller hops - in a similar way to medallion zones.
@MDevion
@MDevion 11 ай бұрын
Finally somebody talking common sense. Its just the layer structure that always have exists. Whether you call it bronze silver gold, or stage help dim fact etc., its a potato, potato discussion. In other words semantics.
@Milhouse77BS
@Milhouse77BS 11 ай бұрын
I call it ETL: Extract (Bronze), Transform (Silver), Load (Gold)
@FR-099
@FR-099 11 ай бұрын
​@@Milhouse77BS what if you ELT? the point here is that BSG covers several transformation types and in some cases several copies of data.
@Milhouse77BS
@Milhouse77BS 11 ай бұрын
BSG I think of "Battle Star Galactica" first :) @@FR-099
@david0aloha
@david0aloha 9 ай бұрын
@@Milhouse77BS Why? ETL (and ELT) are processes. ETL can take raw data from others systems and upload it into a Bronze layer. ETL can take raw Bronze data, make cleaned Silver data out of it, then upload it to your Silver layer. ETL can take cleaned Silver data, join it with other Silver datasets, to make a Gold dataset, then upload it to your Gold layer. ETL can exist entirely outside of medallion architecture and take whatever data you want from a bunch of source systems and produce custom reports for a business unit, director, c-suite, etc. Think of it like these layers are the nouns, and ETL (or ELT) is a verb that can turn parts of one layer into another. Or it can be used entirely outside of a medallion architecture.
@VinayakKommana
@VinayakKommana 4 ай бұрын
@@Milhouse77BS 😂😂😂, nice sense of humor
@kaurivneet1
@kaurivneet1 3 ай бұрын
Simon! I am so glad you made this video. It’s been a constant challenge at different clients explaining that medallion isn’t strictly 3 layers. The databricks blogs and documentation gets vague and prescriptive at the same time.
@LogicallySoundMM
@LogicallySoundMM 6 ай бұрын
This is the kind of content that deserves a love, insightful and celebrate button.
@pauloneufneufneuf
@pauloneufneufneuf 8 ай бұрын
Thanks for this. Having worked on DW ETLs for decades, I wondered what was new about 'Medallion architecture'... answer: Not a lot. Same series of steps, with a different name. The new bit - albeit not that new these days - is the reporting model sitting downstream of the DW / Analytical Model.
@andrewinkler9538
@andrewinkler9538 3 ай бұрын
Some 12 years ago, at the far forgotten times of Powercenter and Teradata, we had a Data Warehouse divided into three parts: Staging area, Work area and DWH. One for raw, flat files, other for curated data and the last one for aggregated data. I have realized that, in this market, we recycle old concepts with new names, so we keep the environment fresh.
@panostanilian
@panostanilian 11 ай бұрын
such a nice way of explaining this. always wondered why everyone nowadays is pushing for this "architecture" which can make sense in a lot of scenarios but I think not as many as people tend to think
@Simondoubt4446
@Simondoubt4446 10 ай бұрын
Thoughtful, articulate, and impactful - as usual! Thank you for this.
@geehaf
@geehaf 11 ай бұрын
Wise Words! I think "Make it your own" is the message. Great video.
@IanSmith22101980
@IanSmith22101980 11 ай бұрын
Thank you for sharing. The message I hear and champion is; governance 😊. Make sure it is clear what can be found where 😊
@gtoms
@gtoms 10 ай бұрын
While I agree in principle I think the value from using Bronze, Silver and Gold is that users understand generally what to expect of the data in each layer. I talk to users about Bronze data so often as a kind of warning - it's a mirror of source so don't expect it to be clean or conformed. So often we forget to define what is meant by various terms and a new person joining an organization from outside where they may have worked with these terms can immediately understand if we propagate their usage. So yeah, it's nothing new, but using the terms consistently has value in itself. Thanks for the video, as usual, your channel always has quality content - it's the Gold layer of analytics videos :-)
@paul_devos
@paul_devos 10 ай бұрын
I definitely have build "Data solutions" with this sort of thinking. The one other one I'd include here that wasn't included was around if you have heavy PII and especially PHI/HIPAA data as often you have to have a governance structure for limiting access to certain users and only certain environments/tools. This is where you may have STG/DEV/TEST/PROD environments. It gets complex, the key is making sure it's clear however and wherever the data is at and can be found by users.
@andrekamman1971
@andrekamman1971 11 ай бұрын
Regarding the names of the layers, I'm still on the fence between airlock code of the planet Druidia, or the album titles of Led Zeppelin Joking aside, love this video, I've recommended it a bunch already!
@RaiOkami
@RaiOkami 11 ай бұрын
Awesome breakdown! Im glad I understood the concept very early on and not end up stuck with that misconception. The way the medallion architecture has been presented is oversimplified. It's okay to present it as is as a start. BUT, it being more of a logical layer should be emphasized. That and business logic and needs is also a major consideration.
@liyashuklinejg
@liyashuklinejg 3 ай бұрын
Thank you! Appreciate the effort. It actually makes way more sense
@julius8183
@julius8183 5 ай бұрын
What a great video - again! You're a legend. Well spoken and explained.
@christophergray3518
@christophergray3518 Ай бұрын
Rewatching this Simon missed a pun with the "layers of the lake." He could have said different depths or depths of the data to lean into the lake analogies. :)
@BergHageman-ry1xr
@BergHageman-ry1xr Ай бұрын
Thank you! In which stage do we transform data into 1) Parquet Files (Schema Val after transforming from Raw (Json to Parq) and Then 2) to Delta tables in your schema? (In Silver Layer on Cleansed and Validated Stage when Data transformed to Delta Tables).
@repsdorph01
@repsdorph01 10 ай бұрын
Briliant! Thanks for making a simple and clear video about the Medallion Architecture. 🙂
@jeanchindeko5477
@jeanchindeko5477 11 ай бұрын
The misconception here is too many peoples think of Bronze, Silver, Gold as physical layer, and when it should be seems as logical layers! And too many think about it a bit too literally. I’m not sure even Databricks ever mentioned you’ll have only one step of refinement in your Silver layer or Gold layer. The number of hops should be tight to a business cases and not defined and predefined in advance. All that can now be managed with tags on your tables, or databases in most of all modern data warehouses.
@DomGidziuSzymciu
@DomGidziuSzymciu 11 ай бұрын
A very interesting perspective on consolidated analytics and ETL, although I may not completely agree with it. I must emphasize that the medallion architecture's contribution to data processing is invaluable, offering numerous benefits across the enterprise. It is simple, comprehensible, and, like any best practice, should be encouraged to be followed.
@nickmcdermaid3015
@nickmcdermaid3015 10 ай бұрын
It offers nothing new. It's just layers. Layers is a concept that has been around for decades
@JulioAcunaMoreno
@JulioAcunaMoreno 7 ай бұрын
Hi Simon. I found your video very insightful. I have a question, though. How do you model the Silver layer? Would you use a set of flat tables? If not Data Vault, what would you suggest for the conformed data?
@JonWrye
@JonWrye Ай бұрын
Simon, when you say that your raw layer (schema validated) is APPEND ONLY, do you mean that you're not checking for garbage and duplicate records? I'm assuming that you're still delta merging to avoid unnecessary duplication from inserting the same record over and over again, correct? Thanks for the great content!
@AdvancingAnalytics
@AdvancingAnalytics Ай бұрын
Hey - nope, with RAW we literally do a straight append, so raw will absolutely have duplicates if we are sent the same records repeatedly (ideally we avoid this during source data extraction). We merge into the BASE layer and deal with deduplication etc there
@Juan-PaulHynek
@Juan-PaulHynek 11 ай бұрын
Brilliantly said Simon! Thanks for this, keen for more videos like this in the future.
@richardcroft3676
@richardcroft3676 2 ай бұрын
This video makes me so so happy.
@sivakumarkanagaraj
@sivakumarkanagaraj 5 ай бұрын
Thank you. You solved a million dollar question I had.
@rankena
@rankena 10 ай бұрын
In many cases, a file comes with perfect data. Would you still move it and store it in all layers, or skip couple of those and have that file in for example Landing, Raw and Semantic. If you have terabytes of data it would not be efficient to move and store data through all layers. And if you decide not to move through all layers and skip some for some layers, then you end up in situation where you need to "join" two tables from different layers, because one of the table does not exist in one of the layers. How do you deal then with such anomality?
@xkhokokox
@xkhokokox 7 ай бұрын
Great great video. Every minute was to the point
@DAVIDODONNELL-t9w
@DAVIDODONNELL-t9w 2 ай бұрын
Brilliant rant - loved it
@zycbrasil2618
@zycbrasil2618 11 ай бұрын
It's just about logical concepts that require abstractions based on your business cases. Period!
@stephanjoubert8568
@stephanjoubert8568 11 ай бұрын
Hi Advancing Analytics Team, At what layer do you start to use the delta format?
@AdvancingAnalytics
@AdvancingAnalytics 11 ай бұрын
As soon as we can get the data into a coherent structure - so for us it's Delta from RAW onwards!
@johnnywinter861
@johnnywinter861 11 ай бұрын
Oomph... love me a metric layer
@sajeershahul8361
@sajeershahul8361 Ай бұрын
a BIG thank you !
@caffin999
@caffin999 Ай бұрын
Great video, thanks. I had a question about the base layer, please. I was wondering what use cases you typically saw for storing the base layer? I’ve typically seen the base layer in the form of views that allow the flow from raw into enriched. I assume you store the base layer as you see use cases around downstream users/consumers wanting a place to source clean untransformed data? I was just thinking from a wider enterprise perspective would it not be better for downstream consumers in this scenario to take data from enriched which is often closer to the companies logical data model? This promotes a more uniformed understanding enterprise wide and can make integration easier? Thanks.
@Anbu_Sampath
@Anbu_Sampath 11 ай бұрын
Initially this sounded lot like early days Data Warehouse with landing/staging table and main tables. With Lake house We can create multiple tables/views from same sources without much risk need not to be bronze, silver, gold.
@jonlunn4114
@jonlunn4114 11 ай бұрын
Great overview. For something like Power BI which would like a star schema, would there be some overlap between the curated and semantic layer?? Or for some items in the curated, it would be a duplication of the object into the semantic layer? with other objects extended by the metrics added where needed??
@Milhouse77BS
@Milhouse77BS 9 ай бұрын
Yes, for example your Date dimension in curated layer may be copied as is into the semantic layer. In Microsoft Fabric it would be a link or shortcut, so data not actually copied.
@jonlunn4114
@jonlunn4114 9 ай бұрын
Ah yes shortcuts in Fabric! That’s a nice idea!
@davidroberts6774
@davidroberts6774 11 ай бұрын
Excellent synthesis, thank you
@ledinhanhtan
@ledinhanhtan 3 ай бұрын
Neat and concrete! 🎉
@NeumsFor9
@NeumsFor9 10 ай бұрын
Anyone doing data mgmt for a while knows there are stages to any data management solution. It's a marketing tool. The key is defining all the sublayers and data management tasks that have to happen such as reference data management, ingestion, intermediate reshaping, etc. The biggest issue I have with the medallion architecture? Most people that know metallurgy would more than slightly shudder if they were ever explained data engineering 😂😂 At the same time, one should not just dismiss someone else suggesting the architecture. It may just be the way they learned. My suggestion is to inquire about where folks think all the subtasks should be.
@NeumsFor9
@NeumsFor9 10 ай бұрын
To me even the gold layer could be the Kimball base star schema for ad hoc reporting. The Platinum layer can be fed by a semantic model. Then, the advanced Kimball concepts such as second level consolidated fact tables or other materialization of drill across result can then feed dashboards......if indeed that is the part of requirements. The reality is speed to market and skill sets of a team will short circuit this approach. That's when you have big trouble and an evolving mess ......
@mimmakutu
@mimmakutu 11 ай бұрын
I have a question: where data is really large, like truely billion record, how much of these layers can be materialised vs how much should be logical?
@SeanRiordon
@SeanRiordon 10 ай бұрын
At which point are you doing any required de-duplication? Base?
@Iangatley
@Iangatley 10 ай бұрын
This aligns pretty well with how I have been designing lakes. I have had internal debates about whether structured coming into the lake should be "unstrcutured" but I see your logic. My question relates to when importing large quantitties of data from external OLTP source. Do you ETL only daily changes or is Raw a full copy? (which doesn't work when receiving few million records a day) Main question: where do you first filter to stop inserting a few terrabyte each day?
@AdvancingAnalytics
@AdvancingAnalytics 9 ай бұрын
We try and ensure all loads from source systems are incremental, but this relies on us having the option - not all sources have change data capture / timestamps / change feeds etc. So we parameterise & config up the data extracts, so we can do incremental where possible, then date partition/full snapshot where we have no other choice!
@JulesTuffrey
@JulesTuffrey 11 ай бұрын
Excellent vid! Thank you.
@riedwaanbassadien
@riedwaanbassadien 3 ай бұрын
Preach it! Sober up everyone, this is real world.
@mooxxmusic
@mooxxmusic 11 ай бұрын
This is really helpful - thanks!
@atulbansal8041
@atulbansal8041 11 ай бұрын
@simon does this also means that we have 6 layer of same data storage in data lake (some raw, some cleaned, some agreegated) ?
@AdvancingAnalytics
@AdvancingAnalytics 11 ай бұрын
Yep, potentially. If you require it and there are business use cases for each materialised layer.
@DollyBastard
@DollyBastard 11 ай бұрын
In my projects I usually have Input/Raw, Enriched+Merged (merge small files into a bigger daily file), and any Gold layer is already in a use-case optimized data store (DWH, SQL DB, ElasticSearch, and so on). The concept of processing everything in object storage and then mirroring to data stores seems a bit surprising.
@simonkarumbi
@simonkarumbi 13 күн бұрын
What about zipped/ compressed files? Land as .zip or their uncompressed version?
@AdvancingAnalytics
@AdvancingAnalytics 11 күн бұрын
Compression in a lake can force reading to be single-threaded, so we tend to avoid traditional compression (ie: .zip) as it'll end up going slower! However, other types of compression (such as parquet with snappy!) are what's known as splittable compression, which CAN be read in parallel, but also has the benefits of smaller reads
@mladum
@mladum 11 ай бұрын
So, when using UC in Databricks, do you create a catalog per layer or do you divide it using schemas? Always in doubt on what the best approach is, especially with DLT, where only one destination catalog is possible.
@jimthorstad
@jimthorstad 11 ай бұрын
I generally see catalogs named dev, stg, prod with similar schemas and cloud storage folder structures in each. This makes it easier to embrace CICD processes since the catalog name can be set as a default in the workspace admin settings and the code can run unchanged as it moves across the environments. Sometimes people want to organize bronze tables in a special schema (iot_bronze) and the silver and gold objects in another (iot). This makes it easier to hide the bronze tables from analysts and dsml engineers if desired. There are also line of business strategies for catalogs. I think more can be written about the patterns and pros/cons to help architects setting these things up.
@AdvancingAnalytics
@AdvancingAnalytics 11 ай бұрын
This usually comes down to a size thing - if you get to a certain size of business, having a single "prod" catalog can be too limiting trying to do all security at the schema level. We've seen dev/test/prod or dev_raw/test_raw/prod_raw where you then want to break raw down into further data product oriented schemas, rather than have a single catalog that encapsulates all layers for all data products. You could well extend that to the product level if you had large, org-wide security that needed to be secured at the workspace level using catalog-workspace bindings. Essentially - design it to the granularity of security you require, balanced with usability for your data consumers
@manishkumar-hp1hw
@manishkumar-hp1hw 9 ай бұрын
I have seen some design where gold layer is kept separate into a database rather than in databricks what can be the thinking behind this?
@AdvancingAnalytics
@AdvancingAnalytics 9 ай бұрын
That's from the olden days when Spark wasn't great at joins & star-schema queries. These days I don't see a driver for adding another tech into the architecture so we keep everything lakehouse-based!
@CetilLee
@CetilLee 18 күн бұрын
@@AdvancingAnalytics Saw a different video about this subject before seeing yours, good video by the way, you confirmed some of my thoughts about medallion. What he said is that if you need granular security on your tables and columns (RLS supported in both lake and DW) then DW is where you need to put your data facing the consumer. I didn't catch any other advantages other than perhaps personas working with dimensional modelling being more accustomed to T-SQL, which may or may not be a real case.
@nickmcdermaid3015
@nickmcdermaid3015 10 ай бұрын
YES! It's not an architecture! I see so many old ideas rehashed with new names for marketing purposes. Layer approach, columnstores. These things have been around for decades
@carlmerritt4355
@carlmerritt4355 5 ай бұрын
Mazing video - subbed
@ra2i3
@ra2i3 7 ай бұрын
Spot on buddy!
@jeanchindeko5477
@jeanchindeko5477 11 ай бұрын
Simon, are all those hops really necessary? How do you managed your PII data with all those hops?
@AdvancingAnalytics
@AdvancingAnalytics 11 ай бұрын
Nope - these are all of the hops a customer might go to, they are absolutely not ALL necessary to ALL organisations. Design the layers to your required processes. PII - depends on your use case. If you don't need it in your lake, strip it out when first performing structural validation. If you need it but need to obfuscate it, then do that during data cleansing and lock down your RAW layer accordingly
@vjraitila
@vjraitila 11 ай бұрын
@@AdvancingAnalytics Somebody already asked when you typically transition to Delta. However, there is also a third angle regarding PII specifically. One might need - and have a justification - for using the actual data, unobfuscated. However, there should still be a mechanism for "surgically" removing it when a legal basis for processing no longer exists (e.g. a person is exercising a right to erasure / right to be forgotten). This becomes potentially problematic when persisting data across ALL the stages. How do you typically handle this? Transient landing, but persistent RAW - the latter in Delta to enable more targeted deletions / anonymization?
@jimthorstad6226
@jimthorstad6226 11 ай бұрын
If you use Databricks, Unity Catalog offers capabilities to help manage PII data when you need to retain it for authorized people/purposes: Volumes let's you control who can see raw data files containing PII before its ingested into Delta, you can also create reusable functions in Unity Catalog to obfuscate and filter PII data from unauthorized people, and Delta and Deletion Vectors help you implement the GDPR "right to be forgotten" without introducing performance issues in your tables.
@vjraitila
@vjraitila 11 ай бұрын
@@jimthorstad6226 Those are all good points and advertisement of features within Databricks, but Simon used the age-old argument of "retaining all data (in all stages), because storage is cheap in the cloud". However, cost is not the only factor when deciding what data to retain - particularly in its raw form. Data engineers are not some special people who are privileged to handle PII, when the company as a whole does not have the right to process it (anymore).
@jeanchindeko5477
@jeanchindeko5477 11 ай бұрын
@@vjraitila on top of that you need to remember information kept in previous versions of your delta table used for time travel.
@samulimustonen2047
@samulimustonen2047 4 ай бұрын
I was asked this medallion architecture in job interview and i didnt know except ive heard the name. The interviewer explained it enthusiastically and i just commented that its the same as any other process, he didnt seem to like my response. I didnt get picked for the role and also didnt want to as the interviewer seemed arrogant. Its not rocket science.
@derkachm
@derkachm 10 ай бұрын
Regarding the semantic layer: usually how do you implement it with lake house? It is purely PowerBi managed or you use Azure Analysis Services?
@nickmcdermaid3015
@nickmcdermaid3015 10 ай бұрын
Power BI data model has pretty much replaced Azure Analysis Services. You shouldn't use AAS for anything new
@derkachm
@derkachm 10 ай бұрын
@@nickmcdermaid3015 thanks for answer!
@patronspatron7681
@patronspatron7681 11 ай бұрын
Simon is there a set of file formats that you conform to in the raw layer? For example if the data source is CSV or Excel do you transform that into delta in the pipeline from data source to the raw layer? Also you store rest related data as json in raw, do you transform it to delta at a later stage?
@AdvancingAnalytics
@AdvancingAnalytics 11 ай бұрын
Once you've picked it up from Landing, it's in a spark dataframe. When it's in a spark dataframe, land it as Delta, always (unless exporting data for a very specific integration)
@patronspatron7681
@patronspatron7681 11 ай бұрын
@@AdvancingAnalytics so if I understand correctly you land the data in the data lake in its original file format and thereafter all data will be conformed to delta in the next layer?
@ManCar1608
@ManCar1608 4 ай бұрын
@@patronspatron7681Land the data in the landing zone as what ever format json, parquet, csv. When writing hour notebook create a spark df on the landing data and then save it as delta in the raw or bronze zone.
@goldwhispers
@goldwhispers 11 ай бұрын
There is so much confusion happening with the definition of what goes in a medallion layers, least of all because people don't necessarily the different type of modelling you do whether its 3rd normal form or dimensional models, the aggregation aspect is confusing to folks, these days many data engineers don't really understand what a dimensional model is, in the old days a bi person was end to end but now the raw to clean conformed is done by data engineers, that's my recent experience, and think you need to aggregate your facts in gold, when actually that's not true because you would mostly always have detailed facts to the lowest grain possible and that should sit in gold, or is this just splitting hairs, at the end of the day like you say as long as we all on the same page it doesn't matter, but as the teams and companies get bigger, i can see a lot of mismatches happening, just like ive seen so many data warehouses built incorrectly the definition of what you doing needs to be clear understood and communicated. Your layers make more sense to me as a long time end to end BI person because it's quite clear what it means.
@nickmcdermaid3015
@nickmcdermaid3015 10 ай бұрын
Agree. All this mention of "aggregates" to me is a waste of breath. Capture facts at lowest grain and optionally add aggregates if you want. But you shouldn't add aggregates for performance. Performance should be solved in the platform. IMHO the best case for aggregates is combining two facts of different grains to make analysis simpler.
@ManCar1608
@ManCar1608 4 ай бұрын
Agree 100%
@fishsauce7497
@fishsauce7497 6 ай бұрын
Spot on
@MelvinAbedini
@MelvinAbedini 11 ай бұрын
One request, can you please use a better mic? your content is so good but needs a better sound quality
@howardh6857
@howardh6857 Ай бұрын
You need it because without the terminology pipelines behave differently
@fpa89
@fpa89 9 ай бұрын
this is gold
@AdvancingAnalytics
@AdvancingAnalytics 9 ай бұрын
*Curated ;)
@mredmister3014
@mredmister3014 3 ай бұрын
Show some examples. 16 min in and still talking hypothetical.
@BarneyBarney-y5l
@BarneyBarney-y5l Ай бұрын
Is this guy actually advocating for copying data 6 separate times? Should rename this video “How to bankrupt you company through ETL”
@AdvancingAnalytics
@AdvancingAnalytics Ай бұрын
Hah, not at all - advocating for "design the layers as appropriate for your company, not accept the 3-layer medallion default". Sure, I've built many 6-layer, rigorous architectures for global FinServ, but if you don't need that, don't build that? Wanna go for 2 layers with a virtual logic layer? Go nuts, as long as you're making that choice that's right for the org
@DollyBastard
@DollyBastard 11 ай бұрын
Is the Semantic layer based off of the Curated or the Enriched layer?
@calmhorizons
@calmhorizons 11 ай бұрын
100% Agreed - I think the medallion "architecture" is really just an extremely poor analogy because it has caused confusion rather than simplifying a concept. Bronze, Silver, Gold makes non-technical users think of Good, Great, Best - but data is not an Olympic event. Bronze, Silver, Gold is telling me what exactly? Is the Bronze data 3rd best? Is the Silver data only slightly less good than Gold? Raw, Prep, Land, Stage, Star, Semantic, BI - words like this have semantic meanings that relate to the intended purpose and integrity of those layers. Thanks for illuminating the source of this abomination. 😁
@FR-099
@FR-099 11 ай бұрын
yeah when I first heard a management consultant use those colors I wept a tear for the death of meaning and nuance I knew would follow
@MonsterManStan
@MonsterManStan 7 ай бұрын
Typically I just manually read the entire database and aggregate the results in my mind.
@CetilLee
@CetilLee 18 күн бұрын
Care to outline what the major differences between the two gold layers in this examples are? Are you building all the different dimensional models your business needs in curated, using spark and lake, and then another layer with basically the same model but with the addition of a semantic model on top with measures and such?
@AdvancingAnalytics
@AdvancingAnalytics 17 күн бұрын
Yep, exactly. Just wanted to capture that there is the logic stored in the lake via materialised delta tables, then additional context-aware measures in a BI tool such as Power BI semantic models!
@SPGremlin
@SPGremlin Ай бұрын
Thanks for the video. I have pretty much same vision (and apply it with other clients), and this is very well-structured. Yes, the oversimplified concept of this "medallion" architecture does indeed greatly confuse clients for the sake of catchy marketing. I have two remarks what I see slightly differently: 1) The "base tables" have schema enforced, changes applied, it is a 1:1 replica of the source system's table, essentially the lowest "queryable" dataset for any consumption but before any logical conformations and transformations; In my view, this should be part of the Bronze "grade" (let's not call it a layer) while you are depicting it as a Silver. With any kind of incremental data extraction pattern, the first two layers ("landing" and "raw" in your slides) are append-only and contain change-sets. Which makes them practically unqueryable. We use last data portion as part of an ETL job to properly apply changes towards the "base" table - but any other user or downstream job can't consume data from these layers, as would require enormous complexity and compute to replay the changeset from the beginning and reconstruct the current state of the data. In many cases it's not even possible. If we saying that the "base" table (the first actually queryable) is "Silver" then our "Bronze" is nothing more than an unqueryable data dumpster. This is very misleading to the client as while they are stuck with the Medallion misnomer it leaves them with only two "broadly recognized" metals for their data layers which is not enough. My take is that "base" == Bronze, what's below it (any kind of stacked append-only changeset datasets, whether unstructured or structured), if retained at all, is pre-bronze. Don't you see it this way? 2) The layer naming you suggested is consistent and makes sense, but I see a lot of confusion surrounding the term "raw". The data engineering world understands it closer to the landing (raw landed data) like you do. But the business data consumers use and understand "bronze" / "base" and "raw" terms refer interchangeably, to them this is all "raw" (as untransformed) source system data, same as in the source system itself. At least I see this perception A LOT in the field. If I was starting a new data platform from scratch today, I would reserve the word "raw" to refer to the same as "base" tables (and also map it a Bronze to map in the medallion paradigm), while calling both of the leftmost layers (landed unstructured files and schema validated) as landing & staging accordingly; Of which depending on the ETL pattern only one or both (or neither) may exist.
@nneomanwamaghinna4948
@nneomanwamaghinna4948 Ай бұрын
Thank you. I've been struggling internally with this and have more clarity and assurance now.
@spillanemike
@spillanemike 11 ай бұрын
Thoroughly enjoyed that Simon, thanks!
@allthingsdata
@allthingsdata 10 ай бұрын
So how do I call the delta between my Delta tables now 🙂
@Marcus-sh8sq
@Marcus-sh8sq 9 ай бұрын
Alpha Delta Zeta. Pretend it's a fraternity.
Behind the Hype: Why Data Mesh Is Not Right For You
21:16
Advancing Analytics
Рет қаралды 14 М.
What's Wrong with the Medallion Architecture?
20:51
Databricks
Рет қаралды 1,3 М.
Пришёл к другу на ночёвку 😂
01:00
Cadrol&Fatich
Рет қаралды 11 МЛН
How Strong is Tin Foil? 💪
00:26
Preston
Рет қаралды 116 МЛН
The day of the sea 😂 #shorts by Leisi Crazy
00:22
Leisi Crazy
Рет қаралды 1,6 МЛН
Behind the Hype: Should you ever build a Data Vault in a Lakehouse?
24:59
Advancing Analytics
Рет қаралды 21 М.
Introduction to Data Mesh with Zhamak Dehghani
1:05:31
Stanford Deep Data Research Center
Рет қаралды 32 М.
Moving to Microsoft Fabric? Here's 36 questions to guide your decisions
26:17
Learn Microsoft Fabric with Will
Рет қаралды 5 М.
Advancing Fabric - Lakehouse vs Warehouse
14:22
Advancing Analytics
Рет қаралды 24 М.
Advancing Spark - The Rise of GenBI with Databricks AI/BI and Genie
23:07
Advancing Analytics
Рет қаралды 3,6 М.
Behind the Hype: Is Data a Product?
22:33
Advancing Analytics
Рет қаралды 4,2 М.