Awesome video. Really liked the templating example. As for doing this more dynamic. You can dynamically set the dlt table name based on a variable. So what you do is create a function called create_bronze_table(mytable:str, df: DataFrame) and when you do @dlt.table(name=mytable) def create_bronze_table(): return df Then you can parameterize the config to be a specific source and loop through all tables calling the create_bronze_table passing in the table and the dataframe You can take this to the next level and setup auto loader to listen to s3 events and pass that to your function to load a whole source of tables. This can then drastically accelerate your source ingestion having a generic pipeline.
@NeumsFor92 жыл бұрын
Took this and wrote to our custom metadata repo (data quality subsection) using the Kimball Architecture For Data Quality as metadata output and the modernized Marco Metamodel for the input settings......including expected file formats and source to targets. Also integrated the audit keys in the targets tables to both our audit dimensions and our data quality subsystem.
@briancuster73553 жыл бұрын
I tried Delta Live tables on a project and it worked out pretty well. I didn't use them entirely for my ETL but I did use them to go from silver to other intermediate silver tables and to gold tables. I found it to be pretty practical and easy to use.
@AdvancingAnalytics3 жыл бұрын
Nice! That's a great use case in my head - using it where it's likely that we have business users expressing logic in SQL, but that still want the elements of engineering applied!
@briancuster73553 жыл бұрын
@@AdvancingAnalytics Yes, we did it expressly for that reason. We have also been trying to automate the whole process and ended up creating a notebook where a user can come in and build a pipeline and save all of the metadata to a database on SQL. Then when it comes time to build the pipeline in Databricks, we have a method that translates the configuration from the database back out to a databricks pipeline. It's pretty cool!
@maheshatutube2 жыл бұрын
@@briancuster7355 Hi Brian, do you have any video or documentation around automating the whole process of building the pipeline in databricks using metadata approach. Any pointers would be highly appreciated
@alexischicoine20723 жыл бұрын
I thought I'd go ahead and use delta live tables for a project even though it's in preview. I didn't have many problems with the sql endpoint but this feature really isn't ready as you warned us in another video. I had used it only for a few simple steps so it was easy to redo using normal spark streaming. I had a few times where the pipeline seemed to be corrupt and it wouldn't load properly. Recreating it from the same notebooks and configuration fixed it. Another problem I got is that sometimes it took almost 10 minutes to start after the cluster initialized which is just too long for something that took 1-2 minutes to run once started. I've found that as I got to explore it there were too many issues that didn't justify the two main befits I saw of seeing the graph of my steps and the data expectations. Looking forward to what they do with it in the future though it could be amazing if done right.
@polakadi3 жыл бұрын
Interesting feature indeed, thank you, Simon, for creating this video!
@WhyWouldYouDrawThat3 жыл бұрын
All going well, we will absolutely be using this. We are looking at using Live Tables to build a data hub. This will primarily supply data to business apps, secondarily power analytics. Can you please help me out by doing a video on this? From everything I’ve read this is absolutely the best tool for this job. Essentially the source of data for 95% of enterprise ETL jobs will be live tables. We like the idea that the data we are using is the same data that is being reported, and is also 100% up to date. I’m also interested in publishing changes from delta tables to Azure data hubs for ease of consumption. Very keen to hear your thoughts and comments.
@jimzhang25622 жыл бұрын
Great videos! Thank you.
@datoalavista5812 жыл бұрын
Brilliant ! . Thank you for sharing
@stvv55463 жыл бұрын
Hey, that was a great example of trying to get certain functionality in a maybe not so standard way. But also confirms how much stuff they (Databricks) need to still roll out in order to have fully functional generic/dynamic pipelines, right? While watching I was really hoping that by the end we will see a convenient way to loop over those 'Address' and 'Product' tables without having to go to the pipeline json and manually change the pipeline parameter. Hopefully we can have something like this in the future releases of DLT. When we got the point that we can't really parametrize the storage, well I got a bit dissapointed. I really hope they provide us with more control above that as well. Thanks Simon for those great insights into Databricks world! Amazing!
@julsgranados68613 жыл бұрын
thank u Simon!! just great :)
@ferrerolounge1910 Жыл бұрын
Wondering if there is a similar feature in adf or az synapse. Well explained as usual!
@mimmakutu3 жыл бұрын
we did this using normal delta table with a json config, using generic merge ie insertAll and updateAll apis. This gets data upto raw or bronze zone in data lake. For us this works for ~700 tables
@AdvancingAnalytics3 жыл бұрын
Yep, absolutely you can achieve this with straight delta and a bit of engineering. The point is that this aims to make doing that super easy for people who are not deep into the data engineering side of things - it can't do everything you could do manually of course, but it's a decent start in making these things more accessible. Simon
@abhradwipmukherjee3697 Жыл бұрын
Excellent video & thanks for the valuable insight. But after creating the raw table from dataframe, can we write the data into another dataframe from the newly created delta table and create the silver table from the new dataframe?
@harisriniram Жыл бұрын
Can we configure the notebook path under libraries section in the JSON and target. We want this to be populated by output from previous task(notebook task type)
@saivama78162 жыл бұрын
Aswesome Thnks a lot Simon. Only ingestion into Bronz can be generic. is there a way to generalize Transformation into Silver lake.
@AdvancingAnalytics2 жыл бұрын
Depends how you define silver! For us, we perform cleaning, validation and audit stamping when going into Silver. We can provide the transformations required as metadata that is looked up by the process, so the template is nice and generic - but you then need to build & manage a metadata framework ;)
@krishnag56243 жыл бұрын
Hi Simon I need your help in azure databricks. Your excellent. Thanks Krish
@AdvancingAnalytics3 жыл бұрын
Thanks
@plamendimitrov90973 жыл бұрын
Did you try the SQL syntax , build those SQL statements dynamically and execute them using spark.sql('...') in a loop?
@AdvancingAnalytics3 жыл бұрын
Not yet, haven't had a chance to dig further into it - easy to give it a quick try though! Hoping we won't need any workarounds once it matures though!
@joyyoung32882 жыл бұрын
how to connect and install adventure works to dbfs, can you share more information? thanks
@AdvancingAnalytics2 жыл бұрын
In this case, I ran a quick Data Factory job to scrape the tables from an Azure SQL DB using a copy activity. There isn't a quick way to get it into DBFS - however there are a TON of datasets mounted under the /databricks-datasets/ mount that are just as good!
@NM-xg7kd3 жыл бұрын
It will be interesting to see how Databricks develop this. It currently looks a bit unwieldy and when you compare to something like multi task jobs which appears organised and easy to follow it just begs the question, where are they going with this, centralised vs decentralised?
@AdvancingAnalytics3 жыл бұрын
Like most things, Databricks tend to be code-first - get it working in a techy, codey way, worry about putting a UI over the top later (if at all). If this is going to be the "citizen engineering" approach they push going forwards, it'll need a bit more polish, for sure! If you look at architecture approaches like data mesh, a lot of it is supported by democratising the tools & making it easier for the domain owners to engineer... which this is certainly heading towards.
@NM-xg7kd3 жыл бұрын
@@AdvancingAnalytics I saw the recent Databricks vid on data mesh and yes interesting decentralised approach. Re data ownership though, I cannot see that flying in most organisations without some serious risk analysis, but like you say, this definitely lends itself to the methodology.
@alexischicoine20723 жыл бұрын
I like your generic approach. Using the spark config might get a bit unwieldy when you start having a lot of steps in your pipeline. I think you could write this generic functionality in a python module you can import in your notebooks and have very simple notebooks that contain the pipeline configuration where you call the functions. Otherwise to support complex pipelines you'll end up having to create a complete pipeline definition language based on text parameters you supply as config? I could see that being worth it if you're trying to integrate this into some other tool where the config would come from but if you're working directly in Databricks as a data engineer I'm not sure what the advantage would be to define your pipelines in this format instead of using the language framework dlt provides.
@gamachu20003 жыл бұрын
Your absolutely right about creating a function. We just deploy two zone silver and bronze and we use a function with a for loop to iterate the config that has all information of our table. Works like a charm. I hope Simon could show that in his advancement of this video. Delta live doesn't support merge so our gold zone we went back to standard spark. There saying the merge feature is coming soon in dlt. Once that happens then everything will be dlt. We are also looking into cdc part of the dlt. Great video for the community Simon. Keep the great work.
@alexischicoine20723 жыл бұрын
@@gamachu2000 Ah yes I had the same issue with the merge and I don't think you can use forEachBatch either. Interesting to know the merge is coming.
@roozbehderakhshan20532 жыл бұрын
Interesting, just a quick question. Can the source of data instead of cloud storage be an streaming source (i.e. Kinesis or Kafka) CREATE INCREMENTAL LIVE TABLE customers COMMENT "The customers buying finished products, ingested from /databricks-datasets." TBLPROPERTIES ("myCompanyPipeline.quality" = "mapping") AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv"); can we do: CREATE INCREMENTAL LIVE TABLE customers COMMENT "The customers buying finished products, ingested from /databricks-datasets." TBLPROPERTIES ("myCompanyPipeline.quality" = "mapping") AS SELECT * FROM ******Kinesis;