We'll be covering data lakes, parquet file format, data compression and shuffle! Make sure to have a www.DataExpert.io account here so you can get the most of this lab!
Пікірлер: 40
@alonzo_go2 ай бұрын
This channel is gold for any young data engineer. I wish I could pay you but you're probably already swimming in enough data :D
@jay_wright_thats_right7 күн бұрын
How do you know that? Did you get a job from what you learned on this channel? Are you actually a data engineer?
@alonzo_go7 күн бұрын
@@jay_wright_thats_right yes, I'm actually a data engineer. I've been a data engineer for a many years now so no, I didn't get a job because of the channel. But I can confirm that he teaches important concepts that are very useful and sometimes not readily available for a beginner engineer.
@justinwilkinson63004 ай бұрын
Great lesson Zach! I have always wondered what the hell a Data Lake is. Great explanations and super easy to understand!
@nobodyinparticula1004 ай бұрын
Zach! We just started our project where we will be transferring our data to Data Lake in parquet! This is a very timely video. Awesome job, as always!
@andydataguy4 ай бұрын
Awesome video man! Just discovered your channel and excited to see more like this
@vivekjha99523 ай бұрын
Zach, I watched this while going office, and I loved the way,learnt hell about lot of things.Thanks for it
@theloniusmonkey51384 ай бұрын
Great and insightful lessons Zach, just high quality content! Your community of loyal DEs is growing :) Keep up!
@murilloandradef4 ай бұрын
amazing class Zach! keep going, thxxx
@rohitdeshmukh1974 ай бұрын
great video Zach, awesome content I learnt a Lot. Can you please make a video or share some content about why we should avoid shuffling, shuffling issues and ways to fix it?
@vivekjha99524 ай бұрын
Its great Video Zach, thoroughly Enjoyed It
@qculryq433 ай бұрын
Wow - I learned so much from this video - Amazing! Thank you for sharing.
@srinubathina71913 ай бұрын
Wow Amazing content Zach Thank you so much
@muhammadzakiahmad80694 ай бұрын
Need more of these videos, beginer friendly💡
@papalaplace4 ай бұрын
Great as always 🎉
@anthonyanalytics3 күн бұрын
Wow this is amazing!
@ManishJindalmanisism4 ай бұрын
Thanks Zach, the practical you showed helped me learn a lot. Can you please tell if I do daily sorted inserts into my iceberg table from my OLTP system using an ETL pipeline, will Iceberg consider that instance 'exclusive' and compress store it or will it look for common columns in existing data files as well and then compress?
@atifiu3 ай бұрын
@zach Thanks for this informative video. I have one question. You mentioned about sorting the data on low cardinality columns and then moving towards high cardinality for better RLE which makes sense to get more compressed data. But on the read side taking an example of ICEBERG we generally try to filter data on high cardinality columns and hence use those columns in sorting the data so that we read fewer data and predicate pushdown will really help in reading very small subset of data. Now both these settings contradict each other, on one side we get smaller data but on the other side we are more concerned about using sorting on high cardinality columns.
@EcZachly_3 ай бұрын
Yep it’s an art! It all depends on what columns are the most likely to be filtered on!
@JP-zz6ql4 ай бұрын
Wow the way people push vc is creative now good video.
@LMGaming02 ай бұрын
Amazing video! + 1 follower :D
@thoughtfulsdАй бұрын
This is amazing . You are a fabulous teacher . Had a question on replication. Is the replication factor not a requirement any more in modern cloud data lakes ?
@EcZachly_Ай бұрын
Nope. Hadoop is dead fam
@zwartepeat35522 ай бұрын
Casually ending the gender debate 😂 good video sir! Very informative
@pauladataanalyst3 ай бұрын
Hello Zach, thanks for the content, after May, when is the next bootcamp?
@EcZachly_3 ай бұрын
There’s a 100% chance May is the last one that I’m teaching a majority (~75%) of the content. September/october would be the next one. I’ll be teaching like… 30-40%
@YEM_4 ай бұрын
The tables you are using for your sources... Are those iceberg tables which are really just files and folders in s3 under the hood, placed there before the training? I'm just confused where the raw is coming from and what it looks like.
@EcZachly_4 ай бұрын
Yep
@LMGaming0Ай бұрын
I have a question, during the whole video you've been dealing with historical data and moving it, what about new data received, how do you deal with it ? do you insert it into some random table then update your iceberg table using some crone jobs or do you insert it directly into iceberg and how?
@EcZachly_Ай бұрын
Collect a daily batch in Kafka then dump it to iceberg
@datbooiiАй бұрын
Another heat vid
@YEM_4 ай бұрын
What SQL syntax is that? (So i can Google it to research more about what options are available to create table).
@EcZachly_4 ай бұрын
Trino which has nearly identical syntax to Postgres
@amankapoor35634 ай бұрын
Apart from reducing the size, does sorted_by helps in read in any other way? Are order by queries efficient with sorted_by?
@EcZachly_4 ай бұрын
You get skipping and stuff like that too when you filter on columns in the sorted_by column so it’s more efficient there too
@sreesanjeev844 ай бұрын
Is it necessary to sort the dataset ? Say what if the compute and time for sorting >>>> Storage consumed? even if the storge is very large it is cheaper right ?? What is the good tipping point here ?
@EcZachly_4 ай бұрын
Depends on down stream consumption and volume. If the data set isn’t used a ton, sorting probably isn’t worth it