Check out my academy at www.DataExpert.io where you can learn all this in much more detail! You can get use code ZACH15 to get 15% off! #dataengineering #netflix
Пікірлер: 390
@sevrantw89316 ай бұрын
I’m so glad I found this video, I was just sitting here with 60 million gigabytes and was figuring out what joins to use so this was perfect timing.
@aripapas10986 ай бұрын
if all u registered was 60 mil gb & joins ur not flowing
@smackastan56976 ай бұрын
You're kidding, but somehow I just started a data analysis project of two terabytes and this video shows up.
@hi-mn5rg6 ай бұрын
@@aripapas1098 if you think comments must indicate a user registered every aspect of a video, ur not following
@derickd61506 ай бұрын
@@aripapas1098this is a sad comment
@00Tenrai006 ай бұрын
Sarcasm ???? 😂
@bilbobeutlin34056 ай бұрын
Can't wait to build hyperscale pipelines for my startup with 0 users
@92kosta6 ай бұрын
But it sounds powerful when you say it, like you mean business.
@npc-drew6 ай бұрын
Based
@vikingthedude6 ай бұрын
1 user (me)
@JGComments6 ай бұрын
If you build it, they will come.
@abhilashpatel68526 ай бұрын
I have 1k TB data just sitting around in my backyard. Glad your video came up to get me started on atleast something.
@supercompooper6 ай бұрын
In the future a wrist watch will have a little blinking light that will have 60 million gigabytes of data in it
@dhillaz6 ай бұрын
You mean an Electron app?
@aripapas10986 ай бұрын
yeah okay crack smoker
@mrevilducky6 ай бұрын
And it will still lag and hit 99% singularities
@Ivan-Bagrintsev6 ай бұрын
@@dhillaz that will just show current time
@supercompooper5 ай бұрын
@@Ivan-Bagrintsev Yes it will show the time, but with full DRM. Unless you have a license to view certain minutes it will be denied.
@supafiyalaito6 ай бұрын
Thanks Zach, hopefully one day I will understand what all of that means
@mu30762 ай бұрын
😂😂😂, I’m starting now
@lucas.p.f6 ай бұрын
Boyfriend simulator: you sit with your bf and he starts talking about this nerdy stuff you have no idea about but need to keep listening because you love him
@EcZachly_5 ай бұрын
This is exactly correctly
@CU.SpaceCowboy5 ай бұрын
aww 🥰
@heykike4 ай бұрын
After marriage they no longer pretend to listen to
@rajns86434 ай бұрын
If only a girl would fall for me when I speak nerdy stuff 🫠
@lucas.p.f4 ай бұрын
@@rajns8643 are you kidding me? This is what most people like the most! Intelligent people are extremely attractive
@RichardOles6 ай бұрын
Holy crap. I’m currently learning about data science, the various roles, etc. -with the hope of one day switching careers. But the current state of learning is all about the languages and software used etc, not about the infrastructure and what to do with massive datasets. So this just 🤯
@samuelisaacs75574 ай бұрын
its really about math but no one talks about it. get at least 1 year university math comprehension and then get into the python and tech tools. the most competent and successful data engineers are always people with a good STEM background. for example Zach has a Bachelor's Degree in Applied Mathematics and a Bachelor's Degree in Computer Science so he is a heavy numbers guy. That's what most of Data Science \ Engineering KZbinrs don't tell their viewers cause that will cause them to loose viewers.
@byRoyalty4 ай бұрын
learning the tools can be very different from solving real world problems.
@rajns86434 ай бұрын
@@samuelisaacs7557 True asf
@stevess77774 ай бұрын
@@samuelisaacs7557Yep, even a business administration bachelors will have a lot of maths and it's nowhere near data science which is 3x that.
@JGComments6 ай бұрын
2 pita bites a day, the same as me when I’m on a diet.😊
@ArjunRajaS5 ай бұрын
If you come across a scenario to join 2 large datasets. You could do an iterative broadcast join. Basically you are going the break one of the df into multiple dfs and join the dataframe in a loop till all the multiple dfs are joined.
@jordanmessec53325 ай бұрын
You’ll require a lot of memory and have long start times, no?
@oakleyorbit3 ай бұрын
Half of what you said I had no idea what you were taking about but I was very engaged and now I’m gonna look all this stuff up for centering my div!
@naraendrareddy27312 күн бұрын
As a guy struggling to get a job because entry level roles require ex[erience, I have learned something new and valuable today. Broadcast and SMB joins.
@rohanbhakat29226 ай бұрын
Thanks for the info Zach. Could you please make an elaboriative video on SMB join.
@dazzassti6 ай бұрын
In the 37 years I’ve been working in data, I’ve never heard anyone call it Peter 😂. PETA
@anotherguy94025 ай бұрын
What's wrong with a Peter bite?
@divinecomedian25 ай бұрын
Heya Peeda
@Starmast3rmusic5 ай бұрын
Could be an accent or a slip 😂
@nikolagrkovic87695 ай бұрын
The amount of knowledge you shared here is astonishing
@john_paul5 ай бұрын
I love how you acronym Sorted Bucket Merge as SMB. Think you may have had Super Mario Bros on the mind 😂
@remo4 ай бұрын
Damn I just wanted to shuffle like there’s no tomorrow and then I found this video.
@SamCyanide6 ай бұрын
My medical science clients called, they need an 800tb imaging data set parsed by end of day (thank you kubernetes)
@vikrampandit21746 ай бұрын
Never thought broadcast join is a Netflix saviour
@Adhanks916 ай бұрын
Informative and straight to the point, great stuff as usual
@picdu28914 ай бұрын
I love technology and I know more than your average user, yet I have no IT qualifications and I am light years away from this knowledge, but for some reason, I love watching these videos as if I was ever going to use the information 😂
@OurNewestMember4 ай бұрын
Interesting! I would have thought something like sharding (or partitioning and clustering) so data processing and access can scale horizontally.
@EcZachly_4 ай бұрын
Bucketing and clustering are similar
@xasm835 ай бұрын
my data pipeline usually processes one pitabyte every other day and one shawarmabyte every week week
@JT-zb6vi6 ай бұрын
instant subscribe - really appreciate the concise explanation and clear examples
@Settings4045 ай бұрын
I love that I’m only a software engineer but I can understand all of this
@ATX_Engineer3 ай бұрын
Ah yes, data structures and sorting… but with the “can you even scale bro” tick enabled.
@souravghosh3586 ай бұрын
Very important concept in such short time.. thank u so very much ❤
@motonoob-i2d6 ай бұрын
That's cool bro. Will it fix the Netflix app where it shows the title of one show but the preview and description of another?
@EcZachly_6 ай бұрын
It was to look at network traffic to keep your credit card data secure
@chrism37904 ай бұрын
What engine were you using to do these massive joins? Spark?
@EcZachly_4 ай бұрын
Yep!
@cry2love5 ай бұрын
I still bite my gigas when my man hustling meta in peta
@aarjunpp5 ай бұрын
1. Are you a data engineer? 2. What tech is this? AWS, Snowflake?
@tanujkhochare34986 ай бұрын
Hey Zach, your content is consistently amazing! As a newcomer to the field, I'm considering diving into data engineering. What roadmap would you recommend, and are there any certifications that could enhance my journey? I already have a solid grasp of Python and SQL in data analysis.
@uwize58975 ай бұрын
optimizing selling personal data to minimize cost is something i never thought about
@ungeschaut6 ай бұрын
I use just a database with just value as field (long string) and nothing else
@twitchizle2 ай бұрын
I really wonder how netflix achieves 100tb/hr just with only streaming videos.
@theAnupamAnandhelueene6 ай бұрын
: multiple streams across entire ddrs directly accessible
@Dmytro-kt3fr4 ай бұрын
would you say that using bucketing and basically constraining against “acceptable” throughput as well as risking on creating gazillion files in process is more acceptable approach then more ad hoc ones like: z ordering and bloom filters?
@emerald424814 ай бұрын
Very useful and interesting, even to a layman
@Dmytro-kt3fr4 ай бұрын
smb part was actually useful
@MrFraiche3 ай бұрын
How do you get a job in this field? Were you in software engineering?
@bruceleehiiiyaaa5 ай бұрын
middle out compression
@TheGoodContent374 ай бұрын
Love the way you tried to make it sound more complicated than it actually is and failed.
@YishuaiLiu6 ай бұрын
Short and informative
@EcZachly_6 ай бұрын
Thank you! What other videos would you like to see from me?
@GnomeEU3 ай бұрын
Now I just need a billion dollar company to have these kinda problems. My question would be, why you have table that big? Can't you distribute or cluster your data? I'm thinking like 10000 users per server. Only stuff around those 10k users gets stored. No magic needed to query stuff.
@EcZachly_3 ай бұрын
Gotta analyze it all together though
@ChessFlix6 ай бұрын
Petabyte was misspelled. Great video though.
@edisco36434 ай бұрын
Can you get a tripod for your cam?
@andreas19892 ай бұрын
Hey Data with Zach .. I have some questions.. So netflix uses AWS servers all over the world.... I am wondering. how many gb is each 4K movies, 1080p movie.. ? :) and what audio mix do they have.. Dolby Atmos, DTX etc. etc. :) Have a good day.. love from sweden :)
@EcZachly_2 ай бұрын
For serving videos they use OpenConnect and CloudFront, not AWS servers. This allows them to serve the video from the closest regional spot to you. Almost all videos can be served in 4k. but are downsampled depending on the current network conditions
@dexnow3 ай бұрын
I suddenly feel like pita bread...
@brandonheaton61974 ай бұрын
He is channeling a young William Benney over here isn't he
@caioreis3506 ай бұрын
Wait, why ordering a table and then joining it is more efficient? Why have I never heard of this technique before? Well, guess it's time to get into some digging
@coding34386 ай бұрын
Ordering a table on the join keys. Thats because for each key in one dataset, the entire other dataset doesn’t have to be scanned.
@bbbbbbao6 ай бұрын
As he said whenever there is shuffling involved performances get really poor. You can try doing some computation in spark/dask against nyc taxi dataset.
@sepg50846 ай бұрын
Because binary search becomes more efficient the bigger your dataset is, and binary search only works when the tables are sorted on your search keys. It also depends on the sorting algorithm.
@sepg50846 ай бұрын
@ayyleeuz4892 because you don't know why it's faster 😉
@kali7865164 ай бұрын
did you used spark or hive at Netflix to process 2000 TB's per day ?
@EcZachly_4 ай бұрын
Spark
@kali7865164 ай бұрын
@@EcZachly_ spark batch or streaming job or structured streaming job ? do you have the parameters which are passed handy ? like number of executors etc ....
@EcZachly_4 ай бұрын
@@kali786516 hourly batch
@kali7865164 ай бұрын
@@EcZachly_ do you mind sharing the spark parameters what you have passed if you have them handy ? probably a video might help
@EcZachly_4 ай бұрын
@@kali786516 that video sounds painfully boring
@dark_lord986 ай бұрын
Are those joins available in MySQl or specific to dbms at meta you worked?
@juanbrekesgregoris44056 ай бұрын
I think they're not available on MySQL because it's an OLTP database. Those joins are used for analytics
@jordanmessec53325 ай бұрын
These are not database joins, they are processing joins. Frameworks such as Flink and Spark would leverage broadcasts. It basically boils down to a single coordinator instance that publishes a small, often changing dataset to all parallel processors. Usually used to enrich, prune, or map the main dataset.
@phitsf54755 ай бұрын
The internet is not something you just dump something on, it's not a big truck. It's a series of tubes.
@maggiejetson79043 ай бұрын
Honestly, 2000 TB per day isn't the problem. The problem is the cost and how much of the data is burst. If it is not burst it is pretty much always cheaper to do it in-house with your own hardware than to pay and rent the cloud to do it.
@sergeikulikov44126 ай бұрын
You shouldn't write "s" in Terabyte per hour, just TB/hr "TBs/hr" looks like "Terabyte*second / hour" 😅
@LaurentziueXtream4 ай бұрын
Can you help me get a job as Data Analyst? I have certifications but employers never hire me
@HaitiSpaceAgency6 ай бұрын
Could some of these problems be addressed with a decentralized , p2p system?
@jordanmessec53325 ай бұрын
That’s effectively the shuffle he’s saying to avoid. In that approach you’re shuffling the large dataset so that it collocates with its relevant piece of the small dataset.
@theactualslimshady6 ай бұрын
Please keep up the great content!
@rutabega3066 ай бұрын
Netflix uses AWS?
@92kosta6 ай бұрын
Well, who doesn’t?
@seansingh44213 ай бұрын
Wth is in that data ? Like seriously i feel like most of that would be redundant shit since even a chemical plant can be run entirely on Excel without ever needing db involvement
@MrTechroundup6 ай бұрын
Does Netflix still uses its own pipelines or any ETL solution like Fivetran?
@EcZachly_6 ай бұрын
Netflix pays engineers $500k/year to write all the pipelines with their bare hands
@iloos74576 ай бұрын
Hey are you familiar with cosmosDB from azure? Its a db like mongo but claims to be able to scale infinitely... What are your thoughts on that?
@matthew.m.stevick4 ай бұрын
What’s a megabyte
@Settings4045 ай бұрын
This was so fucking interesting
@huntermacias20236 ай бұрын
In 30 years our baseline laptops will have 8 Pita bytes of RAM
@nikonnikiforoff6 ай бұрын
If I shuffled all the word in this video, it would still sound same to me.
@adamou025 ай бұрын
That's a rizz !
@zb27474 ай бұрын
Bro 100 TB an hour???? Yo whattt
@TLOGhx6 ай бұрын
Insanely valuable content
@rustychassis6 ай бұрын
Is a pitabyte a petabyte? Or just lots of bites of pita bread?
@NostraDavid26 ай бұрын
Meanwhile I do big data with data sets that are 8TB or smaller. And we only use 1% or less of that data set xD
@DxWangZ6 ай бұрын
I don't quite understand why Netflix needs data pipelines.
@CharanGs-t4i6 ай бұрын
i dont even have single idea what he is talking about , i am just here to improve my youtube algorithm
@explosivecl6 ай бұрын
Thanks for the video
@saisagar796 ай бұрын
Wonder what exact data is that🧐
@EcZachly_6 ай бұрын
Every network request Netflix receives
@v5q2116 ай бұрын
What was your tech stack?
@EcZachly_6 ай бұрын
Spark and Iceberg fam
@Lkdahiya-fn4oy6 ай бұрын
“Personally like” - you’re not asked to marry a join type.
@jcald1114 ай бұрын
What database was used?
@EcZachly_3 ай бұрын
Data lake not data base and s3 + iceberg
@cesarfigueroa61195 ай бұрын
tbh if youre dealing with this much data, it's likely a good problem to have 💰 💰
@jordanalloy1785 ай бұрын
If I have this guy as a mentor, I will be so grateful to God.
@EcZachly_5 ай бұрын
I wish I could mentor everybody!
@jordanalloy1785 ай бұрын
@@EcZachly_ Job weldone sir. I'm a self taught full stack developer. The journey is crazy sir
@YannickSacherer6 ай бұрын
Hey absolutely curious about the content your are doing. In my company we are working dbt and snowflake. I can't find a possibility to work with broadcast joins there. do you see a possibility to replicate this process?
@EcZachly_6 ай бұрын
Snowflake isn’t suitable for volumes >100tbs in my opinion. Clustering is an option in snowflake that helps though
@Jonas89offsuit6 ай бұрын
What hoody is that?
@chrzonszcz3236 ай бұрын
Dude you look like Jon Hopkins
@LuckyGnom6 ай бұрын
Every day I'm shuffelin' 🪩
@nat.serrano4 ай бұрын
This guy earned his half a million salary. I tried to do this myself and failed
@good-questions6 ай бұрын
The knowledgeability is sexy
@parthmalik14 ай бұрын
In a way thats not gonna make Jeff Bezoz Millions of dollars 😂😂
@sunnylulla4323Ай бұрын
He did S ## with that Pipeline....like using flashlight
@EcZachly_Ай бұрын
What makes you think that?
@udaysingh-wr2kw6 ай бұрын
I dont know anything about data science? Why am i watching this?
@sanjaybhatikar5 ай бұрын
2 pita bites 😂
@findmeinthecarpet3 ай бұрын
Wait? So youre making a table? 🤔🥴
@Fennecbutt5 ай бұрын
Pita bites, huh. Hope you got some hummus to go with that
@Goochigoop6 ай бұрын
This man’s got rizz. 😂
@rutabega3066 ай бұрын
No he's got genes
@udirt6 ай бұрын
I wanna hear about "everything I need to know about extreme high volume" AFTER you compare notes with people from various other organisations processing that much or lots more, that are outside your expertise. I.e. LLNL, CERN etc, chemical industry or any large government etc., and scaled down ones, too where you recognise your lessons. Otherwise it's just an incomplete angle. Which would still be OK if you reflect on it. But telling everything people need to know? Nah. You cant, i can't, they cant.
@EcZachly_6 ай бұрын
It’s a two minute video bro, relax
@FMJ7775 ай бұрын
Peterbytes Bucket joining manage shuffle and fcuk Jeff bezos is what I got out of this
@Chris-Christopher-4 ай бұрын
I know an electrician from Alberta and he says your numbers don't make any sense. You says you're either making these numbers up, or you're doing something really stupid. I think maybe you guys should reach out to him, because he can probably get you situated and doing things correctly and fix any misunderstandings you may have.
@clanzu22 ай бұрын
is it me or all the comments look like mockery 😂😂if so i am loving it
@subhasishsarkar51066 ай бұрын
What I absolutely love about your videos is that as a beginner in the data engineering field, you often talk about things that I had no conception of. In this video for example, I have never heard of SMBs or broadcast joins. This gives me an oppurtunity to learn these things, even hearing them be mentioned from someone as widely experienced as you. You need not necessarily have to even go into detail, but these short form videos act as beacons of knowledge that I can throw myself into learning about. Thanks a lot, and keep these coming Zach!
@EcZachly_6 ай бұрын
Really appreciate this comment! It reminds to that the value im putting out there is important!
@vasudevreddy35276 ай бұрын
@@EcZachly_ ✌
@eric.batdorff6 ай бұрын
Great summation! I was thinking the exact same thing while watching. It's nice hearing even the specialized lingo from technical experts in their fields, it peaks my curiosity.
@MrAmitkr0076 ай бұрын
@@EcZachly_thanks
@prawtism6 ай бұрын
@@EcZachly_did you already know the importance of these two before Netflix or did you learn that while working at Netflix?