Advancing Spark - Automated Data Quality with Lakehouse Monitoring

  Рет қаралды 5,874

Advancing Analytics

Advancing Analytics

Күн бұрын

As data engineers, we've built countless ETL scripts that tracked data quality, over and over again. Wouldn't it be lovely if our systems just regularly polled the data and checked the DQ for us? Wouldn't it be great if we could apply a whole set of quality metrics across our tables as standard? Well that's exactly what Databricks Lakehouse Monitoring is!
In this video, Simon takes a quick look at Lakehouse Monitoring, enables it for a sample table and runs through the quality metrics that are captured. If you're not already monitoring the quality of data in your Lakehouse... why not start now?
For more details on Lakehouse Monitoring, check out: learn.microsoft.com/en-us/azu...
If you're after some deep-dive, hands-on Spark training for the festive period, why not check out: advancinganalytics.teachable.com
And if you're embarking on a Lakehouse journey, and want it to deliver serious value, why not give Advancing Analytics a call?

Пікірлер: 8
@saugatmukherjee8119
@saugatmukherjee8119 5 ай бұрын
Your videos are really helpful, Simon ! I read about this last week, but these videos are so helpful, before I have tried them out. Thanks ! Right now, I have a custom framework, where people can put in some yaml configs about a query and an alert definition , based on that query, and then upon checking it into the repo, the pipeline creates a saved query, an alert and a job with a sql_task based on if the alert has a schedule.
@ParkerWoodson-ow3ol
@ParkerWoodson-ow3ol 3 ай бұрын
This is fantastic stuff that, like you said, should be done as a practice as part of the lifecycle management of your data. This could be especially helpful if you don't know where to start on implementing data profiling-testing. Definitely helpful on determining the more specific what, where, and when to do data testing and monitoring. The generic turn the Databricks quality monitoring switch on is only going to get you so far. It'll be excessive in some areas and not enough in others. To make it really useful and not unnecessarily blow out your costs fine tuning this processes is necessary IMO. I'm sure the feature will mature and hopefully allow finer control and extensibility so I'll be watching. Thanks for always keeping us up do date and covering really useful topics in a "What does this mean to my everyday data job?" context.
@alexischicoine2072
@alexischicoine2072 3 ай бұрын
I think it’s important to have an action plan when setting this up. If you don’t have a plan to either work with data producers or decide on canceling a source then I wouldn’t do it. We had previous monitoring of null percentage we retired because investigating the cause took too much time and we have hundreds of data producers.
@yatharthm22
@yatharthm22 3 ай бұрын
In my the two metric tables are not getting created automatically, am i doing something wrong?
@fb-gu2er
@fb-gu2er 4 ай бұрын
It would be good to mention the cost. Do we get charged by the work being done under the hood in DBUs?
@NeumsFor9
@NeumsFor9 5 ай бұрын
It's an ok start. However, this would be way more useful if we could monitor and the data quality associated with each load batch as well. It utilized caching and optimization so that the profiling would not take this long. We built a process to scan files, write all data profiling to a metadata repo, integrate those results with the metadata repo, and query those repos, metrics, and drifts as part of the ETL process, and take actions all based on metadata. What you are showing isn't a bad compliment to that......but I would prefer to see something more actionable. It is a good start though.
@alexischicoine2072
@alexischicoine2072 3 ай бұрын
Lille any background serverless task you get billed for it sounds like it could get real expensive real fast if it’s taking six minutes for 60k rows. Probably not something I would try on my personal account I pay for :).
@alexischicoine2072
@alexischicoine2072 3 ай бұрын
Lazy managed table? I also used to create tables as external but now that Unity catalog has undrop and brings extra functionality to managed tables the decision changes.
Behind the Hype - The Medallion Architecture Doesn't Work
21:51
Advancing Analytics
Рет қаралды 23 М.
Advancing Spark - Reflecting on a Year of Unity Catalog
18:14
Advancing Analytics
Рет қаралды 3,4 М.
Super sport🤯
00:15
Lexa_Merin
Рет қаралды 20 МЛН
小路飞姐姐居然让路飞小路飞都消失了#海贼王  #路飞
00:47
路飞与唐舞桐
Рет қаралды 94 МЛН
ELE QUEBROU A TAÇA DE FUTEBOL
00:45
Matheus Kriwat
Рет қаралды 20 МЛН
Options News Rundown 4-2-15: Show Me The Front Running!
7:54
The Options Insider
Рет қаралды 5
Advancing Spark - Give your Delta Lake a boost with Z-Ordering
20:31
Advancing Analytics
Рет қаралды 26 М.
AI for Data Made Simple | Oracle DatabaseWorld AI Edition
23:21
MASTERCLASS: Demystifying Data Vault
1:15:05
SqlDBM - Online Data Modeling Solution
Рет қаралды 3 М.
Advancing Spark - Setting up Databricks Unity Catalog Environments
21:21
Advancing Analytics
Рет қаралды 15 М.
Advancing Spark - Databricks Delta Live Tables First Look
33:20
Advancing Analytics
Рет қаралды 41 М.
Monitoring Databricks with System Tables
16:05
Dustin Vannoy
Рет қаралды 1,4 М.
Advancing Spark - Understanding the Unity Catalog Permission Model
23:58
Advancing Analytics
Рет қаралды 10 М.
Enable Production ML with Databricks Feature Store
33:12
Databricks
Рет қаралды 9 М.
Super sport🤯
00:15
Lexa_Merin
Рет қаралды 20 МЛН