Advancing Spark - Automated Data Quality with Lakehouse Monitoring

  Рет қаралды 7,456

Advancing Analytics

Advancing Analytics

Күн бұрын

Пікірлер
@saugatmukherjee8119
@saugatmukherjee8119 Жыл бұрын
Your videos are really helpful, Simon ! I read about this last week, but these videos are so helpful, before I have tried them out. Thanks ! Right now, I have a custom framework, where people can put in some yaml configs about a query and an alert definition , based on that query, and then upon checking it into the repo, the pipeline creates a saved query, an alert and a job with a sql_task based on if the alert has a schedule.
@ParkerWoodson-ow3ol
@ParkerWoodson-ow3ol 11 ай бұрын
This is fantastic stuff that, like you said, should be done as a practice as part of the lifecycle management of your data. This could be especially helpful if you don't know where to start on implementing data profiling-testing. Definitely helpful on determining the more specific what, where, and when to do data testing and monitoring. The generic turn the Databricks quality monitoring switch on is only going to get you so far. It'll be excessive in some areas and not enough in others. To make it really useful and not unnecessarily blow out your costs fine tuning this processes is necessary IMO. I'm sure the feature will mature and hopefully allow finer control and extensibility so I'll be watching. Thanks for always keeping us up do date and covering really useful topics in a "What does this mean to my everyday data job?" context.
@yatharthm22
@yatharthm22 11 ай бұрын
In my the two metric tables are not getting created automatically, am i doing something wrong?
@fb-gu2er
@fb-gu2er Жыл бұрын
It would be good to mention the cost. Do we get charged by the work being done under the hood in DBUs?
@alexischicoine2072
@alexischicoine2072 11 ай бұрын
I think it’s important to have an action plan when setting this up. If you don’t have a plan to either work with data producers or decide on canceling a source then I wouldn’t do it. We had previous monitoring of null percentage we retired because investigating the cause took too much time and we have hundreds of data producers.
@NeumsFor9
@NeumsFor9 Жыл бұрын
It's an ok start. However, this would be way more useful if we could monitor and the data quality associated with each load batch as well. It utilized caching and optimization so that the profiling would not take this long. We built a process to scan files, write all data profiling to a metadata repo, integrate those results with the metadata repo, and query those repos, metrics, and drifts as part of the ETL process, and take actions all based on metadata. What you are showing isn't a bad compliment to that......but I would prefer to see something more actionable. It is a good start though.
@alexischicoine2072
@alexischicoine2072 11 ай бұрын
Lille any background serverless task you get billed for it sounds like it could get real expensive real fast if it’s taking six minutes for 60k rows. Probably not something I would try on my personal account I pay for :).
@alexischicoine2072
@alexischicoine2072 11 ай бұрын
Lazy managed table? I also used to create tables as external but now that Unity catalog has undrop and brings extra functionality to managed tables the decision changes.
Behind the Hype - The Medallion Architecture Doesn't Work
21:51
Advancing Analytics
Рет қаралды 34 М.
Will AI Replace Data Engineering? - Advancing Spark
25:38
Advancing Analytics
Рет қаралды 3,3 М.
Try this prank with your friends 😂 @karina-kola
00:18
Andrey Grechka
Рет қаралды 9 МЛН
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН
Dynamic Databricks Workflows - Advancing Spark
21:56
Advancing Analytics
Рет қаралды 6 М.
Advancing Spark - Give your Delta Lake a boost with Z-Ordering
20:31
Advancing Analytics
Рет қаралды 30 М.
Databricks News Oct-Nov 2024 - Advancing Spark
33:19
Advancing Analytics
Рет қаралды 1,6 М.
Databricks - Data Quality - PyDeequ - Introduction
34:49
Apostolos Athanasiou
Рет қаралды 797
Advancing Spark - Row-Level Security and Dynamic Masking with Unity Catalog
20:43
Building Production RAG Over Complex Documents
1:22:18
Databricks
Рет қаралды 23 М.
Power BI on Databricks Best Practices
50:51
Databricks
Рет қаралды 19 М.