Dimensional data modeling and idempotent pipelines in 78 minutes with DataExpert.io

Рет қаралды 21,380

Күн бұрын

We'll be covering:
- Idempotent pipelines
- Why non-idempotent pipelines are problematic
- Things that make your pipelines not idempotent
- Slowly changing dimensions (SCD1, SCD2, SCD3, etc)
Make sure you have a DataExpert.io account to get the most out of this session.
Join www.DataExpert... to get the queries!

Пікірлер: 31

@TheBigWazowski Ай бұрын

I think there are 2 different terms that are getting mixed up and called ‘idempotent’. Purity is the property that whenever you execute an operation, the result is the same as long as the input data is the same. The result doesn’t depend on some external state like the time of day. Any mathematical operation, like f(x) = 2x, is pure. An operation like f(x) = get_current_datetime() is not pure. An idempotent operation is one where if you execute it back to back, you get the same result as if you only executed it once. In math notation that would mean for any input x: f(x) = f(f(x)) f(x) = 2x is NOT idempotent because f(f(x)) = 4x =/= f(x) A simple idempotent operation is rounding up (ceiling). ceil(1.1) = 2 ceil(ceil(1.1)) = ceil(2) = 2 Idempotent operations are important in distributed systems because they give you the ability to retry things without duplicating the end result. Say you have a service that inserts a row into a database, but the database takes a long time to acknowledge the insertion. The service might retry the insertion, thinking that the network dropped the original message, even though the database has already received the insertion and is just slow to acknowledge. If inserting is not idempotent, retrying the insertion will result in duplicate entries in the database. However, you can turn a non-idempotent operation into an idempotent one by attaching a unique id. Then If the operation is received twice by the database, it can go, “I’ve already received this unique id, so I can ignore the second insertion operation” I think the general point of the video was to say that if you’re just running some mathematical computation and inspecting the result, that’s “idempotent” (really should say pure) in the sense that you could re-run the computation on the same input and get the same result. That’s because you’re not performing a stateful change, like inserting into a database. This point is true, but the way it’s explained and the examples given are confusing. I think it misleads people who are learning about this for the first time. Mathematical computations are not always idempotent (in the actual sense of idempotent) and database insertions can be idempotent

@ericaleverson9430 7 ай бұрын

I am a data analyst who is working on transitioning into a data engineer and I love your videos

@333kenshin 4 күн бұрын

my go-to explanation: "make sure the light switch is on" = idempotent "flip the light switch" = non-idempotent

@yubeeee 5 ай бұрын

love the video, thanks Zach. i have found that SCD type 3 can be useful for modeling entities with a fixed lifecycle, such as an order going through various fixed fulfillment stages. but for any data with open-ended changes i wouldn't use it due to the data loss/backfill issues you mentioned

@cristoferfrancovilchis3915 7 ай бұрын

SCD2 are also inside oracle everywhere. Thanks for the great content Zach!!

@tharanga835 7 ай бұрын

Very nice explanation both theoretically and practical, thank you sir for this great lesson, please continue more session

@RobotBoyZzz 7 ай бұрын

Thank you! Interestingly enough, we actually faced a business case in our company, where type 1 SCD is demanded by business users (they want to always see the latest version of grouping in the resulting report). Moreover, some systems have join restrictions (i.e. Clickhouse doesn't allow joins on between clause). So, surprisingly, the need in SCD1 might appear in some cases.

@HaileSelasije-w8d 7 ай бұрын

Great stuff and explanation Zach. Love your content, keep making it.

@charlesmathieu1495 6 ай бұрын

bro is squeezing 2 hours worth of knowledge in 1 hour I can't even watch at 2x speed

@rembautimes8808 16 күн бұрын

Really great content. Will most likely join Data Expert soon. 😂

@FraeuleinYT 6 ай бұрын

44:20 workshop !

@sanooosai Ай бұрын

great thank you sir

@sajidsarkar9574 6 ай бұрын

What is a partition sensor?

@joa0liveira92 13 күн бұрын

A partition sensor is used in orchestration tools (such as Apache Airflow) to check if a particular partition (e.g., for a specific date or time period) has been populated with data. Only when the data in the partition is available (S3 Bucket), the sensor allows the downstream tasks to start (DAG).

@ManishJindalmanisism 7 ай бұрын

Hi Zach. Thanks for this. Can you please tell which IDE is this?

@EcZachly_ 7 ай бұрын

This is DataExpert.io

@salihugurkmll7675 7 ай бұрын

I was expecting the following scenario: Prod table name,country jordan,usa SCD-2 table name,country,from_date,to_date,is_active jordan,usa,2024-01-01,9999-12-31,1 then on 2025-01-01 the prod table is updated in the source. name,country jordan,de Write a script so that the SCD-2 table looks like this: name,country,from_date,to_date,is_active jordan,usa,2024-01-01,2024-12-31,0 jordan,de,2025-01-01,9999,12,31,1 Can you make a video on this?

@EcZachly_ 7 ай бұрын

This video covers that exactly

@EcZachly_ 7 ай бұрын

You want incremental and I’m okay on making that. You can use the queries from this lecture and figure it out

@salihugurkmll7675 7 ай бұрын

@@EcZachly_ thanks Zach, good work!

@ramonsuarez6105 4 ай бұрын

Excellent video, thanks Zach. How about tables were you are asked to have SCD 1 in some columns and SCD 2 in others ?

@RostyslavIlnytskyi 2 ай бұрын

I've searched for copper and I've found gold 😄Zach, your videos are great, thanks a lot!

@soylentpink7845 5 ай бұрын

What are good / standard books on the topic except for kimball.

@joa0liveira92 13 күн бұрын

I would like to know too ! Thanks Zack for the massive knowledge you provide, really grateful 💯

@Han-ve8uh 3 ай бұрын

1:05:27 Why is AJ Griffin 's row_number 1 getting is_active = false? ORDER BY start_season should have given row 1 to the smallest season, and previously the seq was editted to start at the first year where a player has data. 1:04:42 also mentioned everyone's first value should be active