Data Lake Fundamentals, Apache Iceberg and Parquet in 60 minutes on DataExpert.io

Рет қаралды 26,079

4 ай бұрын

We'll be covering data lakes, parquet file format, data compression and shuffle!
Make sure to have a www.DataExpert.io account here so you can get the most of this lab!

Пікірлер: 40

@alonzo_go 2 ай бұрын

This channel is gold for any young data engineer. I wish I could pay you but you're probably already swimming in enough data :D

@jay_wright_thats_right 7 күн бұрын

How do you know that? Did you get a job from what you learned on this channel? Are you actually a data engineer?

@alonzo_go 7 күн бұрын

@@jay_wright_thats_right yes, I'm actually a data engineer. I've been a data engineer for a many years now so no, I didn't get a job because of the channel. But I can confirm that he teaches important concepts that are very useful and sometimes not readily available for a beginner engineer.

@justinwilkinson6300 4 ай бұрын

Great lesson Zach! I have always wondered what the hell a Data Lake is. Great explanations and super easy to understand!

@nobodyinparticula100 4 ай бұрын

Zach! We just started our project where we will be transferring our data to Data Lake in parquet! This is a very timely video. Awesome job, as always!

@andydataguy 4 ай бұрын

Awesome video man! Just discovered your channel and excited to see more like this

@vivekjha9952 3 ай бұрын

Zach, I watched this while going office, and I loved the way,learnt hell about lot of things.Thanks for it

@theloniusmonkey5138 4 ай бұрын

Great and insightful lessons Zach, just high quality content! Your community of loyal DEs is growing :) Keep up!

@murilloandradef 4 ай бұрын

amazing class Zach! keep going, thxxx

@rohitdeshmukh197 4 ай бұрын

great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?

@vivekjha9952 4 ай бұрын

Its great Video Zach, thoroughly Enjoyed It

@qculryq43 3 ай бұрын

Wow - I learned so much from this video - Amazing! Thank you for sharing.

@srinubathina7191 3 ай бұрын

Wow Amazing content Zach Thank you so much

@muhammadzakiahmad8069 4 ай бұрын

Need more of these videos, beginer friendly💡

@papalaplace 4 ай бұрын

Great as always 🎉

@anthonyanalytics 3 күн бұрын

Wow this is amazing!

@ManishJindalmanisism 4 ай бұрын

Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?

@atifiu 3 ай бұрын

@zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.

@EcZachly_ 3 ай бұрын

Yep it’s an art! It all depends on what columns are the most likely to be filtered on!

@JP-zz6ql 4 ай бұрын

Wow the way people push vc is creative now good video.

@LMGaming0 2 ай бұрын

Amazing video! + 1 follower :D

@thoughtfulsd Ай бұрын

This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?

@EcZachly_ Ай бұрын

Nope. Hadoop is dead fam

@zwartepeat3552 2 ай бұрын

Casually ending the gender debate 😂 good video sir! Very informative

@pauladataanalyst 3 ай бұрын

Hello Zach, thanks for the content, after May, when is the next bootcamp?

@EcZachly_ 3 ай бұрын

There’s a 100% chance May is the last one that I’m teaching a majority (~75%) of the content. September/october would be the next one. I’ll be teaching like… 30-40%

@YEM_ 4 ай бұрын

The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.

@EcZachly_ 4 ай бұрын

Yep

@LMGaming0 Ай бұрын

I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?

@EcZachly_ Ай бұрын

Collect a daily batch in Kafka then dump it to iceberg

@datbooii Ай бұрын

Another heat vid

@YEM_ 4 ай бұрын

What SQL syntax is that? (So i can Google it to research more about what options are available to create table).

@EcZachly_ 4 ай бұрын

Trino which has nearly identical syntax to Postgres

@amankapoor3563 4 ай бұрын

Apart from reducing the size, does sorted_by helps in read in any other way? Are order by queries efficient with sorted_by?

@EcZachly_ 4 ай бұрын

You get skipping and stuff like that too when you filter on columns in the sorted_by column so it’s more efficient there too

@sreesanjeev84 4 ай бұрын

Is it necessary to sort the dataset ? Say what if the compute and time for sorting >>>> Storage consumed? even if the storge is very large it is cheaper right ?? What is the good tipping point here ?

@EcZachly_ 4 ай бұрын

Depends on down stream consumption and volume. If the data set isn’t used a ton, sorting probably isn’t worth it