Advancing Spark - Databricks Delta Change Feed

Рет қаралды 14,348

Күн бұрын

The Databricks Change Feed enables CDC, or Change Data Capture, in the spark environment - this is pretty huge. Keeping track of changed records can be a hugely inefficient exercise, comparing masses of records to determine which ones have been changed by upstream events. With the delta table change feed, we can keep an efficient track of exactly which records have changed!
In this video, Simon looks at the initial release, looking at how the change feed works under the covers, what happens in the transaction log and how you get started with CDC in your Lake!
For more details on the new change feed functionality - check out the docs here: docs.microsoft.com/en-us/azur...
As always - don't forget to hit those like & subscribe buttons, and get in touch if we can help with your advanced analytics objectives!

Пікірлер: 22

@Advancing_Terry 3 жыл бұрын

Who doesn't like storage costs? Great video Si.

@Sunkarakrwutarth9389 3 жыл бұрын

very nice feature

@ArcaLuiNeo 3 жыл бұрын

Thanks for this excellent video. Would you have time to do a video on what are some of the better practices for doing "proper" engineering on your local machine, automated testing, CI, etc. Especially since you mentioned open source delta table...

@AdvancingAnalytics 3 жыл бұрын

Our very excellent senior engineer Anna has being doing lots of work in this space. Got some more DevOps/CI/CD/Testing focused topics on the roadmap, hopefully in the next month or two! Simon

@alexischicoine2072 2 жыл бұрын

Hi Simon. Love your videos. Do you know if there's a library or framework that exists to propagate changes from a table to an aggregated table or other modifications? I started my own as I couldn't find anything, but I'd rather use something existing.

@baatchus1519 3 жыл бұрын

👍👍👍

@felipecastro3710 Жыл бұрын

Hey Simon! I usually see people showing examples where the _commit_version is hardcoded. What would be a good way to keep track of the already processed versions of a table, during a silver -> silver flow, for exemple? It could be useful for incrementally loading data to a silver table composed of multiple silver tables joins. Thanks!

@realjackofall Жыл бұрын

Did you find a solution on where/how to store the last commit version?

@chinny4953 9 ай бұрын

Good question. I'm also trying to figure out how to automatically read last changed commit versions in a pipeline. Hardcoding obviously won't work.@@realjackofall

@kuldipjoshi1406 2 жыл бұрын

What about parallel upsert happening over partitioned tables, how to find changed data oneach partition seperately. Bcz versions wilo be created in parallel and we won't be able to determine what versions were created for which partition . How to determine either timestamp or version numbers in that case. What about multiple versions created in same operation .like upsert + optimize will create at least 2-3 version on one go

@AdvancingAnalytics 2 жыл бұрын

Versioning of the delta table works across partitions - you would normally update multiple partitions as a single batch, rather than running separate parallel jobs for each partition. So you see a single delta table version that covers all partition updates, and can rollback using that logic. For upsert & optimize, absolutely you'll see two different versions, one for each batch transaction. Optimize shouldn't affect any downstream tables given there is no actual data change in there! Simon

@ascensiuneabm2683 Жыл бұрын

What is the difference between autoloader and CDF?

@pragattiwari5530 Жыл бұрын

Hi My question might be little dumb but please allow me to ask. Let's say we have a parquet file with 1000 records and my table is CDF enabled table. If i update just one record out of those 1000 records and for that one specific record i have got my change log dir created and parquet file having the two records about current and previous record/ When i will just query my table how it will know that for just one record we have to look other parquet file also what will happen to the one record which is still present in the latest parquet file? pls answer asap

@AdvancingAnalytics Жыл бұрын

It's been a while since I've dug into it, but from memory the standard Delta parquet files separate inserts and updated records. If you query the table looking for table changes, it'll return records from the "inserted" parquet, and records from the change feed directory. If you query the table normally, it just takes all of the normal parquet. I'll double-check that's how it's working in the week! Simon

@pragattiwari5530 Жыл бұрын

@@AdvancingAnalytics Thanks it would be really helpful.

@kuldipjoshi1406 3 жыл бұрын

readstream upserts goona drive data lake crazy on size i think..maybe i can sort of hijack microbatch and do vaccum there(I see risk doint that though)..Useful feature anyways. Looking for video on the example continuation of flights dataset with this feature .

@AdvancingAnalytics 3 жыл бұрын

Yep, definitely gonna be some considerations if you're always upserting. Splitting it into insert and update transactions will make more efficient, but harder to do in a stream! I've not yet looked at whether we can do anything with OPTIMIZE over the change feed... Doubt it but worth a look! Simon

@kuldipjoshi1406 3 жыл бұрын

@@AdvancingAnalytics I tried doing it on stream, Worked till some time and failed . I have put it to trigger on evey few mins then it works fine. Its a lways a tradeoff i think

@bigchallenge7987 2 жыл бұрын

I did not understand the difference between time travel and CDF? Could you please explain?

@AdvancingAnalytics 2 жыл бұрын

Time Travel shows you the state of data at a single point in time. Change Feed shows you which records that changed in between two points in time. Change Feed requires storing additional data, but makes propagating change downstream more efficient!

@bigchallenge7987 2 жыл бұрын

@@AdvancingAnalytics Thanks much, It make more sense now for me:) and we should also not confuse with change data capture from source system. As seen the docs, change data feed is a simplification over CDC. Please would like to have your insights on this?