Data Lake in a Day 2 - Lab 1 Data Ingest

Рет қаралды 926

Күн бұрын

In part one of the Data Lake in a Day series I showed you how to set up the infrastructure for the whole workshop with a simple one click deployment using an ARM template. If you've not already done so, go and run through that video at • Data Lake in a Day 1 -... .
In part two of the series we'll go through Lab 1, this lab is designed to show you how easy it is to ingest data from a database into the data lake. Here, we connect to a SQL Server inside the corporate network using an integration runtime as a proxy. We'll then copy three tables using a tubmling window trigger to take all data over the last year or so, dumping data into one CSV file per day ready for processing on the lake. This shows an ELT process, where the load is bringing data onto a lake ready to process with a massively parallel cluster solution such as Hadoop, Databricks, or Azure Synapse.
0:00 - Introduction to the lab
2:34 - Setting up the containers in a storage account
3:12 - Install and configure the Integration Runtime
7:01 - Set up linked services in Azure Data Factory (ADF)
8:45 - Configure CSV datasets
11:17 - Configure SQL Server datasets
12:37 - Set up the copy activity and pipeline
19:01 - Set up the Tumbling Window trigger
21:08 - Results
22:49 - Recap on the architecture
23:34 - Wrap up
You can find the lab content for this video at github.com/davedoesdemos/Data...
You can find the whole one day workshop at github.com/davedoesdemos/Data... including all lab materials, data and instructions.
If you're new to data lakes please ask any questions you have below. Also please comment if you found this workshop series useful or if you'd like to see more of this kind of content.
For all of my other demos, go to davedoesdemos.com or go straight to the GitHub page at github.com/davedoesdemos/Demo.... Also please subscribe to the channel to make sure the latest demos show up in your playlist!

Пікірлер: 9

@samuelrocha9079 3 жыл бұрын

Amazing, Dave! You rock

@rebeccaperkins7504 2 жыл бұрын

Go Dave

@2manjunath 3 жыл бұрын

from on premises IR to Azure Datafactory traffic is through internet ort private connection ?

@DaveDoesDemos 3 жыл бұрын

Hi Manjunath thanks for the question. The connection is private since it uses TLS encryption and certificates in a similar way to VPN. The traffic is routed to a public endpoint in Azure, although this routing can be made to traverse an ExpressRoute connection (similar to an MPLS) if you so desire. In all of these configurations the traffic is completely private and secure. When using a self hosted IR the connection is outgoing from your data centre on port 443, with no open incoming ports.

@rankena 2 жыл бұрын

Hi, I assume this works only if orders are not updated, and they do not create orders where [date] has already past (today is 2022.05.17, but order created for 2022.05.10). In such cases you will not receive new records, nor any updates. Any suggestions on videos or links on how to manage directory structure when data is updated and created "back in time"?

@DaveDoesDemos 2 жыл бұрын

Hi, what you're talking about is known as "restatement" in retail and yes, this is designed for that scenario. We use tumbling windows precisely because we can re-run a given day/hour/month when a restatement is issued. You simply need to make the downstream pipelines in such a way as to allow that too. Essentially your pipelines should be able to rebuild the whole dataset any time from raw data, or any part of it without affecting the whole. A lot of data folk go down a different path and use the tools as a scheduler and try to do the processing manually - this ends up massively overcomplicated and time consuming as well as harder to change. Hope that helps

@rankena 2 жыл бұрын

@@DaveDoesDemos , maybe retail is a bad example, but let's say you have a table that is constantly updated and you can rely only on "LastUpdated" column which indicates when the row was updated/created. Now if you do a tumbling trigger on LastUpdated column, that would force you to create files and directory structure based on that column. Which leads to bad structure, because nobody queries by "LastUpdated", they query by "OrderDate", "PostingDate", etc.. What could be done I guess is reload every single day (for example by "OrderDate") which has at least one updated record. The question is should I overwrite existing files for those days in blob storage, or place them as new, thus keeping some kind of history, but also introducing duplicates...

@DaveDoesDemos 2 жыл бұрын

@@rankena In that case even better. The lastupdated allows you to only process the changes within the period of the tumbling window, so you get a changed data feed and update your model with the new data. The analytics solution will then contain the up to date version of the data. This can also work with a DataVault approach where you capture all of the new data and use SCD to enable building snapshots of any given time.

@DaveDoesDemos 2 жыл бұрын

@@rankena Also worth mentioning you don't have to use it this way, you can use the other types of trigger. I chose to show this because most traditional data people won't understand the tumbling window approach which this is designed around.