I built data pipelines at Netflix that ran 2000 TBs per day, here’s what I learned about huge data!

Рет қаралды 415,466

Күн бұрын

Check out my academy at www.DataExpert.io where you can learn all this in much more detail!
You can get use code ZACH15 to get 15% off!
#dataengineering
#netflix

Пікірлер: 390

@sevrantw8931 6 ай бұрын

I’m so glad I found this video, I was just sitting here with 60 million gigabytes and was figuring out what joins to use so this was perfect timing.

@aripapas1098 6 ай бұрын

if all u registered was 60 mil gb & joins ur not flowing

@smackastan5697 6 ай бұрын

You're kidding, but somehow I just started a data analysis project of two terabytes and this video shows up.

@hi-mn5rg 6 ай бұрын

@@aripapas1098 if you think comments must indicate a user registered every aspect of a video, ur not following

@derickd6150 6 ай бұрын

@@aripapas1098this is a sad comment

@00Tenrai00 6 ай бұрын

Sarcasm ???? 😂

@bilbobeutlin3405 6 ай бұрын

Can't wait to build hyperscale pipelines for my startup with 0 users

@92kosta 6 ай бұрын

But it sounds powerful when you say it, like you mean business.

@npc-drew 6 ай бұрын

Based

@vikingthedude 6 ай бұрын

1 user (me)

@JGComments 6 ай бұрын

If you build it, they will come.

@abhilashpatel6852 6 ай бұрын

I have 1k TB data just sitting around in my backyard. Glad your video came up to get me started on atleast something.

@supercompooper 6 ай бұрын

In the future a wrist watch will have a little blinking light that will have 60 million gigabytes of data in it

@dhillaz 6 ай бұрын

You mean an Electron app?

@aripapas1098 6 ай бұрын

yeah okay crack smoker

@mrevilducky 6 ай бұрын

And it will still lag and hit 99% singularities

@Ivan-Bagrintsev 6 ай бұрын

@@dhillaz that will just show current time

@supercompooper 5 ай бұрын

@@Ivan-Bagrintsev Yes it will show the time, but with full DRM. Unless you have a license to view certain minutes it will be denied.

@supafiyalaito 6 ай бұрын

Thanks Zach, hopefully one day I will understand what all of that means

@mu3076 2 ай бұрын

😂😂😂, I’m starting now

@lucas.p.f 6 ай бұрын

Boyfriend simulator: you sit with your bf and he starts talking about this nerdy stuff you have no idea about but need to keep listening because you love him

@EcZachly_ 5 ай бұрын

This is exactly correctly

@CU.SpaceCowboy 5 ай бұрын

aww 🥰

@heykike 4 ай бұрын

After marriage they no longer pretend to listen to

@rajns8643 4 ай бұрын

If only a girl would fall for me when I speak nerdy stuff 🫠

@lucas.p.f 4 ай бұрын

@@rajns8643 are you kidding me? This is what most people like the most! Intelligent people are extremely attractive

@RichardOles 6 ай бұрын

Holy crap. I’m currently learning about data science, the various roles, etc. -with the hope of one day switching careers. But the current state of learning is all about the languages and software used etc, not about the infrastructure and what to do with massive datasets. So this just 🤯

@samuelisaacs7557 4 ай бұрын

its really about math but no one talks about it. get at least 1 year university math comprehension and then get into the python and tech tools. the most competent and successful data engineers are always people with a good STEM background. for example Zach has a Bachelor's Degree in Applied Mathematics and a Bachelor's Degree in Computer Science so he is a heavy numbers guy. That's what most of Data Science \ Engineering KZbinrs don't tell their viewers cause that will cause them to loose viewers.

@byRoyalty 4 ай бұрын

learning the tools can be very different from solving real world problems.

@rajns8643 4 ай бұрын

@@samuelisaacs7557 True asf

@stevess7777 4 ай бұрын

@@samuelisaacs7557Yep, even a business administration bachelors will have a lot of maths and it's nowhere near data science which is 3x that.

@JGComments 6 ай бұрын

2 pita bites a day, the same as me when I’m on a diet.😊

@ArjunRajaS 5 ай бұрын

If you come across a scenario to join 2 large datasets. You could do an iterative broadcast join. Basically you are going the break one of the df into multiple dfs and join the dataframe in a loop till all the multiple dfs are joined.

@jordanmessec5332 5 ай бұрын

You’ll require a lot of memory and have long start times, no?

@oakleyorbit 3 ай бұрын

Half of what you said I had no idea what you were taking about but I was very engaged and now I’m gonna look all this stuff up for centering my div!

@naraendrareddy273 12 күн бұрын

As a guy struggling to get a job because entry level roles require ex[erience, I have learned something new and valuable today. Broadcast and SMB joins.

@rohanbhakat2922 6 ай бұрын

Thanks for the info Zach. Could you please make an elaboriative video on SMB join.

@dazzassti 6 ай бұрын

In the 37 years I’ve been working in data, I’ve never heard anyone call it Peter 😂. PETA

@anotherguy9402 5 ай бұрын

What's wrong with a Peter bite?

@divinecomedian2 5 ай бұрын

Heya Peeda

@Starmast3rmusic 5 ай бұрын

Could be an accent or a slip 😂

@nikolagrkovic8769 5 ай бұрын

The amount of knowledge you shared here is astonishing

@john_paul 5 ай бұрын

I love how you acronym Sorted Bucket Merge as SMB. Think you may have had Super Mario Bros on the mind 😂

@remo 4 ай бұрын

Damn I just wanted to shuffle like there’s no tomorrow and then I found this video.

@SamCyanide 6 ай бұрын

My medical science clients called, they need an 800tb imaging data set parsed by end of day (thank you kubernetes)

@vikrampandit2174 6 ай бұрын

Never thought broadcast join is a Netflix saviour

@Adhanks91 6 ай бұрын

Informative and straight to the point, great stuff as usual

@picdu2891 4 ай бұрын

I love technology and I know more than your average user, yet I have no IT qualifications and I am light years away from this knowledge, but for some reason, I love watching these videos as if I was ever going to use the information 😂

@OurNewestMember 4 ай бұрын

Interesting! I would have thought something like sharding (or partitioning and clustering) so data processing and access can scale horizontally.

@EcZachly_ 4 ай бұрын

Bucketing and clustering are similar

@xasm83 5 ай бұрын

my data pipeline usually processes one pitabyte every other day and one shawarmabyte every week week

@JT-zb6vi 6 ай бұрын

instant subscribe - really appreciate the concise explanation and clear examples

@Settings404 5 ай бұрын

I love that I’m only a software engineer but I can understand all of this

@ATX_Engineer 3 ай бұрын

Ah yes, data structures and sorting… but with the “can you even scale bro” tick enabled.

@souravghosh358 6 ай бұрын

Very important concept in such short time.. thank u so very much ❤

@motonoob-i2d 6 ай бұрын

That's cool bro. Will it fix the Netflix app where it shows the title of one show but the preview and description of another?

@EcZachly_ 6 ай бұрын

It was to look at network traffic to keep your credit card data secure

@chrism3790 4 ай бұрын

What engine were you using to do these massive joins? Spark?

@EcZachly_ 4 ай бұрын

Yep!

@cry2love 5 ай бұрын

I still bite my gigas when my man hustling meta in peta

@aarjunpp 5 ай бұрын

1. Are you a data engineer? 2. What tech is this? AWS, Snowflake?

@tanujkhochare3498 6 ай бұрын

Hey Zach, your content is consistently amazing! As a newcomer to the field, I'm considering diving into data engineering. What roadmap would you recommend, and are there any certifications that could enhance my journey? I already have a solid grasp of Python and SQL in data analysis.

@uwize5897 5 ай бұрын

optimizing selling personal data to minimize cost is something i never thought about

@ungeschaut 6 ай бұрын

I use just a database with just value as field (long string) and nothing else

@twitchizle 2 ай бұрын

I really wonder how netflix achieves 100tb/hr just with only streaming videos.

@theAnupamAnandhelueene 6 ай бұрын

: multiple streams across entire ddrs directly accessible

@Dmytro-kt3fr 4 ай бұрын

would you say that using bucketing and basically constraining against “acceptable” throughput as well as risking on creating gazillion files in process is more acceptable approach then more ad hoc ones like: z ordering and bloom filters?

@emerald42481 4 ай бұрын

Very useful and interesting, even to a layman

@Dmytro-kt3fr 4 ай бұрын

smb part was actually useful

@MrFraiche 3 ай бұрын

How do you get a job in this field? Were you in software engineering?

@bruceleehiiiyaaa 5 ай бұрын

middle out compression

@TheGoodContent37 4 ай бұрын

Love the way you tried to make it sound more complicated than it actually is and failed.

@YishuaiLiu 6 ай бұрын

Short and informative

@EcZachly_ 6 ай бұрын

Thank you! What other videos would you like to see from me?

@GnomeEU 3 ай бұрын

Now I just need a billion dollar company to have these kinda problems. My question would be, why you have table that big? Can't you distribute or cluster your data? I'm thinking like 10000 users per server. Only stuff around those 10k users gets stored. No magic needed to query stuff.

@EcZachly_ 3 ай бұрын

Gotta analyze it all together though

@ChessFlix 6 ай бұрын

Petabyte was misspelled. Great video though.

@edisco3643 4 ай бұрын

Can you get a tripod for your cam?

@andreas1989 2 ай бұрын

Hey Data with Zach .. I have some questions.. So netflix uses AWS servers all over the world.... I am wondering. how many gb is each 4K movies, 1080p movie.. ? :) and what audio mix do they have.. Dolby Atmos, DTX etc. etc. :) Have a good day.. love from sweden :)

@EcZachly_ 2 ай бұрын

For serving videos they use OpenConnect and CloudFront, not AWS servers. This allows them to serve the video from the closest regional spot to you. Almost all videos can be served in 4k. but are downsampled depending on the current network conditions

@dexnow 3 ай бұрын

I suddenly feel like pita bread...

@brandonheaton6197 4 ай бұрын

He is channeling a young William Benney over here isn't he

@caioreis350 6 ай бұрын

Wait, why ordering a table and then joining it is more efficient? Why have I never heard of this technique before? Well, guess it's time to get into some digging

@coding3438 6 ай бұрын

Ordering a table on the join keys. Thats because for each key in one dataset, the entire other dataset doesn’t have to be scanned.

@bbbbbbao 6 ай бұрын

As he said whenever there is shuffling involved performances get really poor. You can try doing some computation in spark/dask against nyc taxi dataset.

@sepg5084 6 ай бұрын

Because binary search becomes more efficient the bigger your dataset is, and binary search only works when the tables are sorted on your search keys. It also depends on the sorting algorithm.

@sepg5084 6 ай бұрын

@ayyleeuz4892 because you don't know why it's faster 😉

@kali786516 4 ай бұрын

did you used spark or hive at Netflix to process 2000 TB's per day ?

@EcZachly_ 4 ай бұрын

Spark

@kali786516 4 ай бұрын

@@EcZachly_ spark batch or streaming job or structured streaming job ? do you have the parameters which are passed handy ? like number of executors etc ....

@EcZachly_ 4 ай бұрын

@@kali786516 hourly batch

@kali786516 4 ай бұрын

@@EcZachly_ do you mind sharing the spark parameters what you have passed if you have them handy ? probably a video might help

@EcZachly_ 4 ай бұрын

@@kali786516 that video sounds painfully boring

@dark_lord98 6 ай бұрын

Are those joins available in MySQl or specific to dbms at meta you worked?

@juanbrekesgregoris4405 6 ай бұрын

I think they're not available on MySQL because it's an OLTP database. Those joins are used for analytics

@jordanmessec5332 5 ай бұрын

These are not database joins, they are processing joins. Frameworks such as Flink and Spark would leverage broadcasts. It basically boils down to a single coordinator instance that publishes a small, often changing dataset to all parallel processors. Usually used to enrich, prune, or map the main dataset.

@phitsf5475 5 ай бұрын

The internet is not something you just dump something on, it's not a big truck. It's a series of tubes.

@maggiejetson7904 3 ай бұрын

Honestly, 2000 TB per day isn't the problem. The problem is the cost and how much of the data is burst. If it is not burst it is pretty much always cheaper to do it in-house with your own hardware than to pay and rent the cloud to do it.

@sergeikulikov4412 6 ай бұрын

You shouldn't write "s" in Terabyte per hour, just TB/hr "TBs/hr" looks like "Terabyte*second / hour" 😅

@LaurentziueXtream 4 ай бұрын

Can you help me get a job as Data Analyst? I have certifications but employers never hire me

@HaitiSpaceAgency 6 ай бұрын

Could some of these problems be addressed with a decentralized , p2p system?

@jordanmessec5332 5 ай бұрын

That’s effectively the shuffle he’s saying to avoid. In that approach you’re shuffling the large dataset so that it collocates with its relevant piece of the small dataset.

@theactualslimshady 6 ай бұрын

Please keep up the great content!

@rutabega306 6 ай бұрын

Netflix uses AWS?

@92kosta 6 ай бұрын

Well, who doesn’t?

@seansingh4421 3 ай бұрын

Wth is in that data ? Like seriously i feel like most of that would be redundant shit since even a chemical plant can be run entirely on Excel without ever needing db involvement

@MrTechroundup 6 ай бұрын

Does Netflix still uses its own pipelines or any ETL solution like Fivetran?

@EcZachly_ 6 ай бұрын

Netflix pays engineers $500k/year to write all the pipelines with their bare hands

@iloos7457 6 ай бұрын

Hey are you familiar with cosmosDB from azure? Its a db like mongo but claims to be able to scale infinitely... What are your thoughts on that?

@matthew.m.stevick 4 ай бұрын

What’s a megabyte

@Settings404 5 ай бұрын

This was so fucking interesting

@huntermacias2023 6 ай бұрын

In 30 years our baseline laptops will have 8 Pita bytes of RAM

@nikonnikiforoff 6 ай бұрын

If I shuffled all the word in this video, it would still sound same to me.

@adamou02 5 ай бұрын

That's a rizz !

@zb2747 4 ай бұрын

Bro 100 TB an hour???? Yo whattt

@TLOGhx 6 ай бұрын

Insanely valuable content

@rustychassis 6 ай бұрын

Is a pitabyte a petabyte? Or just lots of bites of pita bread?

@NostraDavid2 6 ай бұрын

Meanwhile I do big data with data sets that are 8TB or smaller. And we only use 1% or less of that data set xD

@DxWangZ 6 ай бұрын

I don't quite understand why Netflix needs data pipelines.

@CharanGs-t4i 6 ай бұрын

i dont even have single idea what he is talking about , i am just here to improve my youtube algorithm

@explosivecl 6 ай бұрын

Thanks for the video

@saisagar79 6 ай бұрын

Wonder what exact data is that🧐

@EcZachly_ 6 ай бұрын

Every network request Netflix receives

@v5q211 6 ай бұрын

What was your tech stack?

@EcZachly_ 6 ай бұрын

Spark and Iceberg fam

@Lkdahiya-fn4oy 6 ай бұрын

“Personally like” - you’re not asked to marry a join type.

@jcald111 4 ай бұрын

What database was used?

@EcZachly_ 3 ай бұрын

Data lake not data base and s3 + iceberg

@cesarfigueroa6119 5 ай бұрын

tbh if youre dealing with this much data, it's likely a good problem to have 💰 💰

@jordanalloy178 5 ай бұрын

If I have this guy as a mentor, I will be so grateful to God.

@EcZachly_ 5 ай бұрын

I wish I could mentor everybody!

@jordanalloy178 5 ай бұрын

@@EcZachly_ Job weldone sir. I'm a self taught full stack developer. The journey is crazy sir

@YannickSacherer 6 ай бұрын

Hey absolutely curious about the content your are doing. In my company we are working dbt and snowflake. I can't find a possibility to work with broadcast joins there. do you see a possibility to replicate this process?

@EcZachly_ 6 ай бұрын

Snowflake isn’t suitable for volumes >100tbs in my opinion. Clustering is an option in snowflake that helps though

@Jonas89offsuit 6 ай бұрын

What hoody is that?

@chrzonszcz323 6 ай бұрын

Dude you look like Jon Hopkins

@LuckyGnom 6 ай бұрын

Every day I'm shuffelin' 🪩

@nat.serrano 4 ай бұрын

This guy earned his half a million salary. I tried to do this myself and failed

@good-questions 6 ай бұрын

The knowledgeability is sexy

@parthmalik1 4 ай бұрын

In a way thats not gonna make Jeff Bezoz Millions of dollars 😂😂

@sunnylulla4323 Ай бұрын

He did S ## with that Pipeline....like using flashlight

@EcZachly_ Ай бұрын

What makes you think that?

@udaysingh-wr2kw 6 ай бұрын

I dont know anything about data science? Why am i watching this?

@sanjaybhatikar 5 ай бұрын

2 pita bites 😂

@findmeinthecarpet 3 ай бұрын

Wait? So youre making a table? 🤔🥴

@Fennecbutt 5 ай бұрын

Pita bites, huh. Hope you got some hummus to go with that

@Goochigoop 6 ай бұрын

This man’s got rizz. 😂

@rutabega306 6 ай бұрын

No he's got genes

@udirt 6 ай бұрын

I wanna hear about "everything I need to know about extreme high volume" AFTER you compare notes with people from various other organisations processing that much or lots more, that are outside your expertise. I.e. LLNL, CERN etc, chemical industry or any large government etc., and scaled down ones, too where you recognise your lessons. Otherwise it's just an incomplete angle. Which would still be OK if you reflect on it. But telling everything people need to know? Nah. You cant, i can't, they cant.

@EcZachly_ 6 ай бұрын

It’s a two minute video bro, relax

@FMJ777 5 ай бұрын

Peterbytes Bucket joining manage shuffle and fcuk Jeff bezos is what I got out of this

@Chris-Christopher- 4 ай бұрын

I know an electrician from Alberta and he says your numbers don't make any sense. You says you're either making these numbers up, or you're doing something really stupid. I think maybe you guys should reach out to him, because he can probably get you situated and doing things correctly and fix any misunderstandings you may have.

@clanzu2 2 ай бұрын

is it me or all the comments look like mockery 😂😂if so i am loving it

@subhasishsarkar5106 6 ай бұрын

What I absolutely love about your videos is that as a beginner in the data engineering field, you often talk about things that I had no conception of. In this video for example, I have never heard of SMBs or broadcast joins. This gives me an oppurtunity to learn these things, even hearing them be mentioned from someone as widely experienced as you. You need not necessarily have to even go into detail, but these short form videos act as beacons of knowledge that I can throw myself into learning about. Thanks a lot, and keep these coming Zach!

@EcZachly_ 6 ай бұрын

Really appreciate this comment! It reminds to that the value im putting out there is important!

@vasudevreddy3527 6 ай бұрын

@@EcZachly_ ✌

@eric.batdorff 6 ай бұрын

Great summation! I was thinking the exact same thing while watching. It's nice hearing even the specialized lingo from technical experts in their fields, it peaks my curiosity.

@MrAmitkr007 6 ай бұрын

@@EcZachly_thanks

@prawtism 6 ай бұрын

@@EcZachly_did you already know the importance of these two before Netflix or did you learn that while working at Netflix?