Advancing Spark - Databricks Delta Live Tables First Look

Рет қаралды 41,509

3 жыл бұрын

From the initial Spark Summit talks about "engineering pipelines" we've been super excited to see where Databricks will go with automated engineering. Earlier this year we saw Delta Live Tables announced... but what do they actually do?
In this first-look video, Simon digs into the DLT Quickstart, picking apart what the code is actually doing, highlighting a few misconceptions and getting you started with your first Delta Live Table Pipeline!
As a reminder, at this time it is still a public preview feature, so you might not have access just yet, but you can still explore the docs and read up about how it might help you here: docs.microsoft.com/en-us/azur...
As always - don't forget to Like & Subscribe, and let us know what you think!

Пікірлер: 55

@Sriramiyer1992 Жыл бұрын

They way the code was explained was outstanding!

@MegaSb360 Жыл бұрын

WoW !!! Thank you so much. For the last couple of months, I have been struggling to understand DLT. I wish I had known sooner that a ~30mins video would do the trick.

@ashrafrcet 3 жыл бұрын

Thank you Simon for covering dlt in few mins.. Much helpful as always..

@Simondoubt4446 2 жыл бұрын

Love these videos. Thank you Simon!

@danielperico2806 3 жыл бұрын

Wow, that's amazing. Thank you Simon!

@user-ui1oh5zf6t 4 ай бұрын

Very good!! Explained perfectly!

@NeumsFor9 Жыл бұрын

For anyone coming from visual etl, just think check constraints + SSIS error path in metadata when check constraints or other constraints or violated + data quality output process metadata + ability to define your own hardware.....minus the overhead of the RDBMS transaction log.

@amateurvisser 2 жыл бұрын

Great explanation. Thanks. Wondering how to do incremental loading, reprocessing, watermarks and all that good stuff.

@JianZhouVA 3 жыл бұрын

Oh crap. I wrote my own Delta Live Table-like implementation (not as fancy, of course) early this year. Now I need to make a choice... Need to read the docs and get on a call with the Databricks folks. Got a lot of questions. Thanks for the video!

@AdvancingAnalytics 3 жыл бұрын

I think there's a lot of people in the same place! Build/Maintain your own framework with all the flexibility, or take the out-of-the-box for simplicity. Will be interesting as it matures, how feasible the latter is!

@bobj8690 2 жыл бұрын

haha me too! I have also been working on my own implementation that aims to populate a pipeline of DeltaLake tables. My biggest challenge is to figure out 'which part of downstream table needs to be updated because the corresponding part of upstream table is updated'. Somehow I think the 'inode' concept in OS File System might help... Would be interesting to see DLT's appraoch!

@deenquotes786 2 жыл бұрын

good work Man :)

@suresh.suthar.24 Жыл бұрын

best explanation

@biancairis93 2 жыл бұрын

this looks really cool indeed. The expectation checks are neat, I wonder what else they will introduce to make DQ/testing of the pipelines easier.

@dmitryanoshin8004 3 жыл бұрын

Nice work! Not sure where to use it now, but looks cool!

@krishnakoirala2088 Жыл бұрын

Always thanks for those awesome videos you create. A question: How can we make those 3 tables created appeared in data tab under a schema (a.k.a database)? What happens with those 3 data folders created in storage if we don't specify the location/path while configuring, where do they go?

@vikashmishra2759 2 жыл бұрын

Hi Simon....That was a really nice video and I love it. Even my all doubts are cleared. Do you have video for merging delta table using Z ordering & multilevel partition to optimize incremental load? If yes please share the link.

@lackshubalasubramaniam7311 2 жыл бұрын

Delta live tables is an odd name. However the workflow concept is really cool. Been playing with it and like the expectations bit. The dependency appearing as a diagram is cool too...somewhat of a lineage concept. Prefer python over SQL. Find the SQL bit limiting. Probably works for Data Analyst role as you mentioned.

@devanshsharma7929 2 жыл бұрын

Hi, thanks for the awesome video. I would like to know whether DLT can read data from kafka or not? At our company, we wish to read data from kafka, transform it and then load it to Cosmos DB. Want to know whether this can be possible using DLT.

@gardnmi 2 жыл бұрын

It almost seems like they asked a scala developer to write some python code and he took some creative freedoms. That is the craziest looking code I've ever seen from a professional company trying to sell a product.

@nithishreddy752 Жыл бұрын

Can we implement CDC capture and column name changes or transformations in single layer for DLT?

@sankhachakraborty5801 2 жыл бұрын

Thanks for the video Simon. Have enjoyed watching it as always. I have got a quick question. Can we execute the Delta Live table pipelines from orchestrators such as Data Factory or Apache Airflow?

@alexischicoine2072 2 жыл бұрын

I'm not Simon but what I've been doing is creating a job that runs the pipeline and then starting the job using the rest api. If you need it to wait for the job to finish you can write a while loop to get the status of the job using the api. You can do the calls from your tool or do that in a notebook on a tiny machine if that's easier.

@AVGMachine 2 жыл бұрын

Great video! Databricks is decided to improve the number of services provided overtime. However, it's getting a bit confusing since we seem to have now services that are competing within the same ecosystem. When would you recommend to use ADF instead of DLT?

@limitlesslife7536 2 жыл бұрын

when you have data processing steps that are not encapsulated in Databricks environment then use ADF. If all your ETL steps are in Databricks then use DLT.

@alexischicoine2072 2 жыл бұрын

Does anyone know how to start a full refresh without clicking on the button on the ui? I've only been able to setup a regular refresh but not the full. I have a use case where I run a streaming query but periodically I save the output and restart a new streaming input so it doesn't grow too large for steps that run over the whole data and aren't streamed. Right now I send myself an email to go and do it it's not ideal. Otherwise the refresh throws an error as expected after modifying the streaming input.

@pranesh1213 Жыл бұрын

Can we do model scoring within the delta table definition script? Like pick a model from the registry and load it as udf and apply it live?

@kuldipjoshi1406 2 жыл бұрын

Is it incremental load or full load. What does behind the scenes write statement looks like. Can we partition , bucket while writing ?

@jespermartinsson8331 Жыл бұрын

How do you actually specify the storage location in the delta lake pipeline to azure delta lake without mounting it to dbfs?

@ezequielchurches5916 2 ай бұрын

clickstream_raw is mapped with BRONZE layer? clickstream_cleaned is mapped with SILVER LAYER? how could I map each delta table with the medallion layers?

@neelbanerjee7875 Жыл бұрын

Thanks for this awesome video.. However a quick question - Using python we can imply multiple additional functionalities over a dataframe in side the delta live table function, like custom function, UDFand also multi step programming (as you shown).. but don't think we can do all those using SQL in delta table.. Could pls correct me if wrong?

@AdvancingAnalytics Жыл бұрын

Nope, you don't have the same iterative power as you would in Pyspark, but you can certainly achieve a lot. I've not tested whether Databricks SQL Functions work inside DLT, but if they do that's most of the functionality you list covered!

@morrolan Жыл бұрын

I always develop/test my notebook code locally, and then as a final step deploy to Databricks. With DLT, I feel the costs will skyrocket with those clusters needing to be running, and also it is very slow. I am really hesitant to use this in it's current state.

@tiagorente2860 2 жыл бұрын

We write our notebooks in Scala but looking at you video the supported languages are Pythion and SQL. Do you know if Scala will be a possible language to use in DLT?

@AdvancingAnalytics 2 жыл бұрын

Honestly don't know! Some of the more abstracted Databricks elements (table access control, passthrough ADD etc) are python/SQL only, so it may be a similar limitation? No idea what the future plan is inside Databricks! Simon

@advanceddataengineering3784 2 жыл бұрын

Is it possible to use some other source like a JDBC database or azure event hub instead of cloud files? BTW, I watch your videos regularly. Great Work!! Thanks.

@AdvancingAnalytics 2 жыл бұрын

Yep, anything that has a spark dataframe reader! I'm sure there's a little bit of nuance with the weirder ones like event hubs, but it's just spinning up a spark job so should be doable with most things spark can read! Simon

@alexischicoine2072 3 жыл бұрын

Love your channel. Regarding the import dlt I found it annoying as well. I think it might be possible to hack together a fake import dlt to be able to have the function definitions at least run and provide some autocomplete with doc strings. I'm going to look into it if I can grab some source code at runtime. Have you had a chance to look at the new multi-task jobs / orchestration? For example I got a use case where I run a merge from some parquet source into a delta table that I then use as my first source for streaming in my delta live tables pipeline. With this new feature they can be ran one after the other in the same job. Keep up the good work your videos are really clear and you have an engaging presentation.

@AdvancingAnalytics 3 жыл бұрын

I'm holding off on mocking up the dlt library for now - I'm hoping that it'll be baked into future DBX runtimes (at least for autocomplete etc as you say), and it's just the preview nature that means it uses a custom runtime... but we'll see what it looks like as it moves towards general availability! Haven't looked at multi-task jobs yet - I checked yesterday and my workspace isn't enabled yet. I'll have a check early next week and put together a quick vid! Simon

@joyyoung3288 Жыл бұрын

please help the storage location? as bump into some problems.

@denisgodunov6157 5 ай бұрын

Transformations very similar to what we have in DBT

@thomasadams6860 3 жыл бұрын

Thoughts on dbt compared to this? Seems very similar.

@AdvancingAnalytics 3 жыл бұрын

Yeah, seems to be aiming at a similar space but has a lot less polish so far. Honestly I've only dabbled with DBT so can't comment much further!

@dmitryanoshin8004 3 жыл бұрын

Does it replace Azure Data Factory in some extension?

@AdvancingAnalytics 3 жыл бұрын

Potentially - certainly does some of the orchestration elements, but certainly isn't as good at other workflow elements, copying data into the platform etc!

@prashantthakur4324 2 жыл бұрын

Would it be possible to use Delta Live tables for some temporary jobs like giving user a decrypted data where a decryption job runs and once the use is over we delete the data from Workspace

@AdvancingAnalytics 2 жыл бұрын

You certainly could - seems like a lot of setup if it's a throwaway bit of data. Probably easier to manage that via a custom notebook?

@prashantthakur4324 2 жыл бұрын

@@AdvancingAnalytics we would like to expose this to other clients like Tableau so within databricks custom notebook is fine for external clients we wanted to use this option

@PersonOfBook 2 жыл бұрын

How is this different from an SQL View. Also can I do upserts and deletes on a Delta table using this?

@AdvancingAnalytics 2 жыл бұрын

The view objects literally are just SQL Views, only difference is the extra wrapping that lets you materialise the data back to physical Delta tables. Updates & Deletes aren't currently supported, but I'm hoping we'll see them in there eventually!

@AdvancingAnalytics 2 жыл бұрын

Speak of the devil - just announced: databricks.com/blog/2022/02/10/databricks-delta-live-tables-announces-support-for-simplified-change-data-capture.html

@PersonOfBook 2 жыл бұрын

@@AdvancingAnalytics That is amazing. Thanks for sharing the link :)

@GuillaumeBerthier 3 жыл бұрын

Thanks Simon for this new video, I love your channel ! I'm not very familiar with Databricks but I just did some practice on Azure Synapse (based on your video kzbin.info/www/bejne/oqGlfmePo5eeabc ) and after watching at this new video on Delta Live Table I was wondering if the same "outcome" couldn't be achieved with Synapse (scheduled) Pipeline and Data Flows (with Delta Table in Sink) ....Or did I completely missed the point (which is completely possible :p )

@AdvancingAnalytics 3 жыл бұрын

Hey! Yep, you could build a pipeline loading a delta table using Synapse pipelines & data flows that would achieve the same loading for the quick batch example - the deeper point of DLT is around incremental loading, the data quality elements and (hopefully) building some reusable transformation functions, which you wouldn't be able to do to the same level in a Synapse data flow! Simon

@NeumsFor9 Жыл бұрын

Expectations...... Assert transforms....... check constraints.......same stuff, different day.....