Advancing Spark - Making Databricks Delta Live Table Templates

Рет қаралды 9,474

Күн бұрын

Пікірлер: 33

@RadThings Жыл бұрын

Awesome video. Really liked the templating example. As for doing this more dynamic. You can dynamically set the dlt table name based on a variable. So what you do is create a function called create_bronze_table(mytable:str, df: DataFrame) and when you do @dlt.table(name=mytable) def create_bronze_table(): return df Then you can parameterize the config to be a specific source and loop through all tables calling the create_bronze_table passing in the table and the dataframe You can take this to the next level and setup auto loader to listen to s3 events and pass that to your function to load a whole source of tables. This can then drastically accelerate your source ingestion having a generic pipeline.

@NeumsFor9 2 жыл бұрын

Took this and wrote to our custom metadata repo (data quality subsection) using the Kimball Architecture For Data Quality as metadata output and the modernized Marco Metamodel for the input settings......including expected file formats and source to targets. Also integrated the audit keys in the targets tables to both our audit dimensions and our data quality subsystem.

@briancuster7355 3 жыл бұрын

I tried Delta Live tables on a project and it worked out pretty well. I didn't use them entirely for my ETL but I did use them to go from silver to other intermediate silver tables and to gold tables. I found it to be pretty practical and easy to use.

@AdvancingAnalytics 3 жыл бұрын

Nice! That's a great use case in my head - using it where it's likely that we have business users expressing logic in SQL, but that still want the elements of engineering applied!

@briancuster7355 3 жыл бұрын

@@AdvancingAnalytics Yes, we did it expressly for that reason. We have also been trying to automate the whole process and ended up creating a notebook where a user can come in and build a pipeline and save all of the metadata to a database on SQL. Then when it comes time to build the pipeline in Databricks, we have a method that translates the configuration from the database back out to a databricks pipeline. It's pretty cool!

@maheshatutube 2 жыл бұрын

@@briancuster7355 Hi Brian, do you have any video or documentation around automating the whole process of building the pipeline in databricks using metadata approach. Any pointers would be highly appreciated

@alexischicoine2072 3 жыл бұрын

I thought I'd go ahead and use delta live tables for a project even though it's in preview. I didn't have many problems with the sql endpoint but this feature really isn't ready as you warned us in another video. I had used it only for a few simple steps so it was easy to redo using normal spark streaming. I had a few times where the pipeline seemed to be corrupt and it wouldn't load properly. Recreating it from the same notebooks and configuration fixed it. Another problem I got is that sometimes it took almost 10 minutes to start after the cluster initialized which is just too long for something that took 1-2 minutes to run once started. I've found that as I got to explore it there were too many issues that didn't justify the two main befits I saw of seeing the graph of my steps and the data expectations. Looking forward to what they do with it in the future though it could be amazing if done right.

@polakadi 3 жыл бұрын

Interesting feature indeed, thank you, Simon, for creating this video!

@WhyWouldYouDrawThat 3 жыл бұрын

All going well, we will absolutely be using this. We are looking at using Live Tables to build a data hub. This will primarily supply data to business apps, secondarily power analytics. Can you please help me out by doing a video on this? From everything I’ve read this is absolutely the best tool for this job. Essentially the source of data for 95% of enterprise ETL jobs will be live tables. We like the idea that the data we are using is the same data that is being reported, and is also 100% up to date. I’m also interested in publishing changes from delta tables to Azure data hubs for ease of consumption. Very keen to hear your thoughts and comments.

@jimzhang2562 2 жыл бұрын

Great videos! Thank you.

@datoalavista581 2 жыл бұрын

Brilliant ! . Thank you for sharing

@stvv5546 3 жыл бұрын

Hey, that was a great example of trying to get certain functionality in a maybe not so standard way. But also confirms how much stuff they (Databricks) need to still roll out in order to have fully functional generic/dynamic pipelines, right? While watching I was really hoping that by the end we will see a convenient way to loop over those 'Address' and 'Product' tables without having to go to the pipeline json and manually change the pipeline parameter. Hopefully we can have something like this in the future releases of DLT. When we got the point that we can't really parametrize the storage, well I got a bit dissapointed. I really hope they provide us with more control above that as well. Thanks Simon for those great insights into Databricks world! Amazing!

@julsgranados6861 3 жыл бұрын

thank u Simon!! just great :)

@ferrerolounge1910 Жыл бұрын

Wondering if there is a similar feature in adf or az synapse. Well explained as usual!

@mimmakutu 3 жыл бұрын

we did this using normal delta table with a json config, using generic merge ie insertAll and updateAll apis. This gets data upto raw or bronze zone in data lake. For us this works for ~700 tables

@AdvancingAnalytics 3 жыл бұрын

Yep, absolutely you can achieve this with straight delta and a bit of engineering. The point is that this aims to make doing that super easy for people who are not deep into the data engineering side of things - it can't do everything you could do manually of course, but it's a decent start in making these things more accessible. Simon

@abhradwipmukherjee3697 Жыл бұрын

Excellent video & thanks for the valuable insight. But after creating the raw table from dataframe, can we write the data into another dataframe from the newly created delta table and create the silver table from the new dataframe?

@harisriniram Жыл бұрын

Can we configure the notebook path under libraries section in the JSON and target. We want this to be populated by output from previous task(notebook task type)

@saivama7816 2 жыл бұрын

Aswesome Thnks a lot Simon. Only ingestion into Bronz can be generic. is there a way to generalize Transformation into Silver lake.

@AdvancingAnalytics 2 жыл бұрын

Depends how you define silver! For us, we perform cleaning, validation and audit stamping when going into Silver. We can provide the transformations required as metadata that is looked up by the process, so the template is nice and generic - but you then need to build & manage a metadata framework ;)

@krishnag5624 3 жыл бұрын

Hi Simon I need your help in azure databricks. Your excellent. Thanks Krish

@AdvancingAnalytics 3 жыл бұрын

Thanks

@plamendimitrov9097 3 жыл бұрын

Did you try the SQL syntax , build those SQL statements dynamically and execute them using spark.sql('...') in a loop?

@AdvancingAnalytics 3 жыл бұрын

Not yet, haven't had a chance to dig further into it - easy to give it a quick try though! Hoping we won't need any workarounds once it matures though!

@joyyoung3288 2 жыл бұрын

how to connect and install adventure works to dbfs, can you share more information? thanks

@AdvancingAnalytics 2 жыл бұрын

In this case, I ran a quick Data Factory job to scrape the tables from an Azure SQL DB using a copy activity. There isn't a quick way to get it into DBFS - however there are a TON of datasets mounted under the /databricks-datasets/ mount that are just as good!

@NM-xg7kd 3 жыл бұрын

It will be interesting to see how Databricks develop this. It currently looks a bit unwieldy and when you compare to something like multi task jobs which appears organised and easy to follow it just begs the question, where are they going with this, centralised vs decentralised?

@AdvancingAnalytics 3 жыл бұрын

Like most things, Databricks tend to be code-first - get it working in a techy, codey way, worry about putting a UI over the top later (if at all). If this is going to be the "citizen engineering" approach they push going forwards, it'll need a bit more polish, for sure! If you look at architecture approaches like data mesh, a lot of it is supported by democratising the tools & making it easier for the domain owners to engineer... which this is certainly heading towards.

@NM-xg7kd 3 жыл бұрын

@@AdvancingAnalytics I saw the recent Databricks vid on data mesh and yes interesting decentralised approach. Re data ownership though, I cannot see that flying in most organisations without some serious risk analysis, but like you say, this definitely lends itself to the methodology.

@alexischicoine2072 3 жыл бұрын

I like your generic approach. Using the spark config might get a bit unwieldy when you start having a lot of steps in your pipeline. I think you could write this generic functionality in a python module you can import in your notebooks and have very simple notebooks that contain the pipeline configuration where you call the functions. Otherwise to support complex pipelines you'll end up having to create a complete pipeline definition language based on text parameters you supply as config? I could see that being worth it if you're trying to integrate this into some other tool where the config would come from but if you're working directly in Databricks as a data engineer I'm not sure what the advantage would be to define your pipelines in this format instead of using the language framework dlt provides.

@gamachu2000 3 жыл бұрын

Your absolutely right about creating a function. We just deploy two zone silver and bronze and we use a function with a for loop to iterate the config that has all information of our table. Works like a charm. I hope Simon could show that in his advancement of this video. Delta live doesn't support merge so our gold zone we went back to standard spark. There saying the merge feature is coming soon in dlt. Once that happens then everything will be dlt. We are also looking into cdc part of the dlt. Great video for the community Simon. Keep the great work.

@alexischicoine2072 3 жыл бұрын

@@gamachu2000 Ah yes I had the same issue with the merge and I don't think you can use forEachBatch either. Interesting to know the merge is coming.

@roozbehderakhshan2053 2 жыл бұрын

Interesting, just a quick question. Can the source of data instead of cloud storage be an streaming source (i.e. Kinesis or Kafka) CREATE INCREMENTAL LIVE TABLE customers COMMENT "The customers buying finished products, ingested from /databricks-datasets." TBLPROPERTIES ("myCompanyPipeline.quality" = "mapping") AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv"); can we do: CREATE INCREMENTAL LIVE TABLE customers COMMENT "The customers buying finished products, ingested from /databricks-datasets." TBLPROPERTIES ("myCompanyPipeline.quality" = "mapping") AS SELECT * FROM ******Kinesis;