SQL Best Practices - Designing An ETL - Part 1

  Рет қаралды 73,072

Seattle Data Guy

Seattle Data Guy

Күн бұрын

Пікірлер: 40
@abnuru1784
@abnuru1784 5 жыл бұрын
This is down to earth . I do appreciate your level of understanding and passing it on to me as a beginner
@SeattleDataGuy
@SeattleDataGuy 5 жыл бұрын
Thank you for your kind words Abdul! Glad I could pass on some knowledge.
@yaqubhassan51
@yaqubhassan51 5 жыл бұрын
By far one of the best i have seen on data mapping and stages of ETL...Thanks Ben! Keep them coming!!
@SeattleDataGuy
@SeattleDataGuy 5 жыл бұрын
Thank you! I am trying to find time between work and consulting!
@passais
@passais 5 жыл бұрын
Is a Part 2 coming? I could not find it. Very good vid!
@zhangleo9192
@zhangleo9192 3 жыл бұрын
I can’t find it either
@elysel9424
@elysel9424 3 жыл бұрын
kzbin.info/www/bejne/aKKwZIiijZ2titE&ab_channel=SeattleDataGuySeattleDataGuy i think this
@jpank11
@jpank11 3 жыл бұрын
You’re a natural teacher. Great vid!
@SeattleDataGuy
@SeattleDataGuy 3 жыл бұрын
You're too kind! Thank you!
@ablack0
@ablack0 4 жыл бұрын
Awesome explanation. Thank you. I have lots of questions about this topic. How do you handle additions of new data? Do the raw and stage databases get cleared out each time new data is added to the data warehouse? Also how do you handle changes in ETL logic over time? How would you handle a situation where a portion of historical data needs to be reloaded into the data warehouse possibly using new ETL logic? Should it always be possible to recreate the entire data warehouse from the flat files? If so how do you ensure that the current state of the data warehouse is the same as it would be if it were blasted away and recreated from scratch? Are there any strategies for version control of a data warehouse?
@higiniofuentes2551
@higiniofuentes2551 3 жыл бұрын
Very interesting video, with a lot of ideas. Because of the title, the content of the video is bit different, I thought it was more focused in presenting the "players" to construct an ETL. The idea of raw data seems to come more from the operational database than from the flat files (csv, etc), and after the cleaning/staging/mapping comes the Stage DB and then the Data warehouse. I can't wait to see part 2 !
@AIMagician996
@AIMagician996 2 жыл бұрын
where is part 2?
@09soleil
@09soleil 2 жыл бұрын
No part 2 ? What about the logging and error tracking video? Really nice video, thank you very much
@KoldFiyure
@KoldFiyure 5 жыл бұрын
Thank you so much. I needed this for an interview I have coming up. I need some formal concepts for what I am doing at a current job where things aren't exactly referred to in these ways.
@SeattleDataGuy
@SeattleDataGuy 5 жыл бұрын
Let me know if you have any other specific questions or concepts you would like covered!
@Thiago280690
@Thiago280690 2 жыл бұрын
Great explanation!
@SeattleDataGuy
@SeattleDataGuy 2 жыл бұрын
Thank you! Man this video is old..i need to make a new one
@Georgehwp
@Georgehwp 2 жыл бұрын
Would be great if you could make this more specific with something like DBT or prefect/dagster/airflow involved.
@weouthere6902
@weouthere6902 2 жыл бұрын
Wish you made more content like this
@peekguyy3194
@peekguyy3194 Жыл бұрын
thanks for this
@SeattleDataGuy
@SeattleDataGuy Жыл бұрын
Glad you liked it!
@00EagleEye00
@00EagleEye00 3 жыл бұрын
Hi there. Got a question on raw data (flatfiles). These doesn't have identities or keys so you formulate a candidate key by combining some columns (product, location, target_year). Here's the question, if there are some columns to correct and it belongs to the combination of candidate keys, how can the data be corrected or updated? What approach is to be made ?
@SeattleDataGuy
@SeattleDataGuy 3 жыл бұрын
I am sorry for leaving this question so long. I really wasn't sure the best way to respond. If you're still curious, do you mind rephrasing the question?
@00EagleEye00
@00EagleEye00 3 жыл бұрын
​@@SeattleDataGuy in the absence of a unique/id keys for flatfiles, there are suggestions to use composite keys or combine columns to be used as a candidate keys (sample product, area, product_year columns resulting to : apple-washington-1999) as identity. Question is what if one of the candidate columns have nulls or empty value (e.g. no area value resulting to : apple-1999 as a key) on the first ingestion then a later update came with fill-up value (let say area is indiana that would result to apple-indiana-1999 as a key). What is the best approach to update this kind of scenario? Would that result to a data loss since the first combination keys are incomplete then when an update came and it was completed, record with incomplete combination keys should be removed/cleaned?
@hakank.560
@hakank.560 3 жыл бұрын
As an financial auditor I want to extract data from our clients database and then manipulate it to have auditiformation. Is learning SQL language the best thing to do? Like to hear form you.
@SeattleDataGuy
@SeattleDataGuy 3 жыл бұрын
That's interesting. Is there any other way to extract the data? Learning SQL is a big lift and might not be a good time trade off. If you need to do analytics on the data then i would say yes. But if you're just auditing, it might not provide the same benefit.
@lambdakicks
@lambdakicks 4 жыл бұрын
What about removing nulls and malformed entries -- would you recommend doing that prior to Staging or afterwards?
@lambdakicks
@lambdakicks 4 жыл бұрын
Excellent video by the way, thank you!
@SeattleDataGuy
@SeattleDataGuy 4 жыл бұрын
Personally, I prefer loading raw data as is. Regardless if there are errors. Why? Because how do you know where the errors came from? If you add a bunch of logic into your raw load, then that logic could mess up your data and you might not know it. However, if you have a problem in your raw data and all you are really doing is loading, then you know the problem is in the data itself. This isn't a rule, more of a guidelines. It depends how messy your data is. If 90% of your data is messy and needs cleaning, then consider doing it prior, but if only 5% is, then it might be better to do later.
@kshitijpathak8646
@kshitijpathak8646 2 жыл бұрын
Great Video! I have one clarifying question. Is there a need to create csv, xml etc files from operational DB and then load the data into Raw DB. Will it not be easier and efficient to simply load the data from operational DB to Raw DB without creating any files in between?
@SeattleDataGuy
@SeattleDataGuy 2 жыл бұрын
This can depend. Some people like to do this for observability sake.
@michailo87
@michailo87 2 жыл бұрын
Did you ever done some data quality monitoriing system and send notifications for loading issues.? It's help to track issues on every level of loading.
@JanUnitra
@JanUnitra 2 жыл бұрын
Where slowly changing dimension process should be done, in staging or DW ?
@mohamedarif2303
@mohamedarif2303 3 жыл бұрын
Very nice but could you provide more videos on ETL pls thanks
@ninjaturtle205
@ninjaturtle205 Жыл бұрын
i think i now understand everything.
@roshanshah5028
@roshanshah5028 4 жыл бұрын
Hi i have few questions related to this topic.
@user-yq9kr9sy9i
@user-yq9kr9sy9i Жыл бұрын
Please provide an example and explain this again
@50tigres79
@50tigres79 3 жыл бұрын
where is part 2?
@elysel9424
@elysel9424 3 жыл бұрын
kzbin.info/www/bejne/aKKwZIiijZ2titE&ab_channel=SeattleDataGuySeattleDataGuy
@ninjaturtle205
@ninjaturtle205 Жыл бұрын
mindd blowwnnn
Cute kitty gadgets 💛
00:24
TheSoul Music Family
Рет қаралды 15 МЛН
The Giant sleep in the town 👹🛏️🏡
00:24
Construction Site
Рет қаралды 20 МЛН
7 Database Design Mistakes to Avoid (With Solutions)
11:29
Database Star
Рет қаралды 72 М.
ETL Made Simple | What Data Analysts Should Know
6:58
Data Wizardry
Рет қаралды 11 М.
Solving one of PostgreSQL's biggest weaknesses.
17:12
Dreams of Code
Рет қаралды 191 М.
Learn to Efficiently Test ETL Pipelines
35:13
Databricks
Рет қаралды 10 М.
SQL Syntax Best Practices: How to Structure Your SQL Code
16:35
StrataScratch
Рет қаралды 22 М.
ETL with Python
57:19
Mean, Median and Moose
Рет қаралды 64 М.