Dealing With Big Data - Computerphile

Рет қаралды 75,913

Күн бұрын

Пікірлер: 173

@griof 3 жыл бұрын

Developer: we use a 3Gb database to plot some dashboards with statistical information about the customer's behavior Marketing team: we use big data, machine learning and artificial intelligence to analyze and predict customer's action at any given time.

@NeinStein 3 жыл бұрын

But do you use blockchain?

@Dontcaredidntask-q9m 3 жыл бұрын

Hahaha this is so true

@klontjespap 3 жыл бұрын

exactly this.

@laurendoe168 3 жыл бұрын

Not quite.... "Marketing team: we use big data, machine learning and artificial intelligence to MANIPULATE customer's action at any given time."

@johnjamesbaldridge867 3 жыл бұрын

@@NeinStein With homomorphic encryption, no less!

@RealityIsNot 3 жыл бұрын

The problem with the word big data is that it went from a technical jargon to a marketing one.. and marketing department don't care what the word means.. they create their own meaning 😀. Other examples include AI ML

@kilimanjarocruz660 3 жыл бұрын

100% this!

@Piktogrammdd1234 3 жыл бұрын

Cyber!

@landsgevaer 3 жыл бұрын

ML seems pretty well defined to me...? For AI and BD, I agree.

@danceswithdirt7197 3 жыл бұрын

Big data is two words. ;)

@nverwer 3 жыл бұрын

More examples: exponential, agile, ...

@letsburn00 3 жыл бұрын

I never realised just how much information there was to store until I tried downloading half a decade of satellite images from a single satellite at a fairly low resolution. It was a quarter Terabyte per Channel and it was producing over a dozen channels. Then I had to process it....

@isaactriguero3155 3 жыл бұрын

what did you use to process it?

@letsburn00 3 жыл бұрын

@@isaactriguero3155 Python. I started with CV2 to convert to Numpy arrays, then work with Numpy. But it was taking forever until I learnt about Numba. Numba plus pure type Numpy arrays are astonishingly effective compared to pure python. I'll never look back now I got used to using Numba. I need my work to integrate with Tensorflow too, so Python works well with that.

@gherbihicham8506 3 жыл бұрын

@@letsburn00 yeah that's still not big data, since you are only using python on presumably on a single node cluster. If this data is comming in real time and needs to be processed instantly than you'll need streaming tools like Apache Kafka. To be stored analysed/mined than it needs special storage and processing engines like Hadoop, NoSQL stores and Spark, rarely do you use traditional RDBMs as stores unless they are special enterprise level appliances like Teradata, greenplume or Oracle appliances etc.. Data processed using traditional methods like single node machines and traditional programming language libraries are not Big data problems. Many people confuse that because they think big volumes of data are Big data.

@letsburn00 3 жыл бұрын

@@gherbihicham8506 Oh, I know it's not big data. "The Pile" is big data. This is just a tiny corner of the information that is available. But it was a bit of interesting perspective for me.

@thekakan 3 жыл бұрын

@@NoNameAtAll2 I switch back and forth between Python, R and Julia. I love all three! Julia is the fastest, but R and Python have far better support and (usually) easier time in development. When you need the best compute power, Julia it is! It's quite amazing

@Jamiered18 3 жыл бұрын

It's interesting, because at my company we deal with petabytes of data. Yet, I'm not sure you could call that "big data", because it's not complex and it doesn't require multiple nodes to process.

@ktan8 3 жыл бұрын

But you'll probably need multiple nodes to store petabytes of data?

@jackgilcrest4632 3 жыл бұрын

@@ktan8 maybe only redundancy

@Beyondthecustomary 3 жыл бұрын

@@ktan8 large amounts of data are often stored in raid for speed and redundancy.

@mattcelder 3 жыл бұрын

That's why big data isn't the same thing as "large volume". "Large" is subjective and largely dependant on your point in time. 30 years ago, you could've said "my company deals with gigabytes of data" and that would've sounded ridiculously huge, like petabytes do today. But today we wouldn't call "gigabytes" big data. For the same reason, we wouldn't call "petabytes" "big data" unless there's more to it than sheer volume.

@AyushPoddar 3 жыл бұрын

@@Beyondthecustomary not necessarily, most of the large data I've seen being stored (think pb) is stored in a distributed storage like HDFS which came out of Google GFS, since RAID would provide redundancy and fault tolerance but there are no HD that I know of that can store a single PB file and it'll surely not be inexpensive as RAID suggests.

@SlackWi 3 жыл бұрын

i work in bioinformatics and i would totally agree that 'big data' is anything i have to run on our university cluster

@urooj09 3 жыл бұрын

@you- tube well you have to study biology a bit. In my college bioinformatics students take atleast one semester if biology and then they take courses on biology depending on what they wanna code.

@kevinhayes6057 3 жыл бұрын

"Big Data" is talked about everywhere now. Really great to hear an explanation of it's fundamentals.

@AboveEmAllProduction 3 жыл бұрын

More like 10 years ago it was talked about alot

@codycast 3 жыл бұрын

@@AboveEmAllProduction no, gender studies

@iammaxhailme 3 жыл бұрын

I used to work in computational chemistry... I had to use large GPU-driven compute clusters to do my simulations, but I wouldn't call it big data. I'd call it "big calculator that crunches molecular dynamics for a week and then pops out a 2 mb results .txt file" lol

@igorsmihailovs52 3 жыл бұрын

Did you use network storage for MD? Because I was surprised to hear in this video how it is specific to big data. I am doing CC now but not MD, QC.

@iammaxhailme 3 жыл бұрын

@@igorsmihailovs52 Not really. SSH into massive GPU comp cluster, start the simulation, SCP the results files (which were a few gigs at most) back to my own PC. Rinse and repeat.

@KilgoreTroutAsf 3 жыл бұрын

Coordinates are usually saved only once every few hundred steps, with intermediate configurations being highly redundant and easy to reconstruct from the nearest snapshot. Because of that MD files are typically not very large.

@mokopa 3 жыл бұрын

"If you're using Windows, that's your own mistake" INSTANT LIKE + FAVORITE

@leahshitindi8365 Жыл бұрын

We had three hours lecture with Isaac last month. It was very interesting

@gubbin909 3 жыл бұрын

Would love to see some future videos on Apache Spark!

@recklessroges 3 жыл бұрын

Yes. There is so much more to talk about on this topic. I'd like to hear about ceph and Tahoe-LAFS.

@isaactriguero3155 3 жыл бұрын

I am working on it :-)

@thisisneeraj7133 3 жыл бұрын

*Apache Hadoop enters the chat*

@albertosimeoni7215 3 жыл бұрын

Better to spend some words even on apache Druid

@sandraviknander7898 3 жыл бұрын

Freaky! I had this exact need of data locality on our cluster for the first time in my work this week.

@NoEgg4u 3 жыл бұрын

@3:23 "...the digital universe was estimated to be 44 zeta-bytes", and half of that is adult videos.

@Sharp931 3 жыл бұрын

*doubt*

@Phroggster 3 жыл бұрын

@@Sharp931 You're right, it's probably more like two-thirds.

@G5rry 3 жыл бұрын

The other half is cats

@lightspiritblix1423 3 жыл бұрын

I'm actually studying these concepts at college, this video could not have come at a more convenient time!

@TheMagicToyChest 3 жыл бұрын

Stay focused and diligent, friend.

@nandafprado 3 жыл бұрын

"If you are using windows that is your own mistake" ...well that is the hard truth for data scientists lol

@sagnikbhattacharya1202 3 жыл бұрын

5:10 "If you're using Windows, that's your own mistake" truer words have never been spoken

@evilsqirrel 3 жыл бұрын

As someone who works more on the practical side of this field, it really is a huge problem to solve. I work with data sets where we feed in multiple terabytes per day, and making sure the infrastructure stays healthy is a huge undertaking. It's cool to see it broken down in a digestible manner like this.

@shiolei 3 жыл бұрын

Awesome simple explanation and diagrams. Loved this breakdown!

@jaffarbh 2 жыл бұрын

One handy trick is to reduce the number of "reductions" in a map-reduce task. In other words, more training, less validation. The downside this could mean the training coverages more slowly

@quanta8382 3 жыл бұрын

Take a drink everytime they say data for the ultimate experience

@seancooper8918 3 жыл бұрын

We call this approach "Dealing With Big Drinking".

@chsyank 3 жыл бұрын

Interesting video. I worked on and designed big data building large databases for litigation in the early 1980... that was big at the time. Then a few years later creating big data for shopping analysis. The key is that big data is big for the years that you are working on it and not afterwards as storage and processing gets bigger and faster. I think that while analysis and reporting is important, (otherwise there is no value to the data) I do believe that designing and building proper ingestion and storage designs are as important. My two cents from over 30 years of building big data.

@GloriousSimplicity 3 жыл бұрын

The industry is moving away from having long-term storage on compute nodes. Since data storage needs grow at a different rate than compute needs, the trend is to have a storage cluster and a compute cluster. This means that applications start a bit slower as the data must be transferred from the storage cluster to the compute cluster. However it allows for more efficient spending on commodity hardware.

@kellymoses8566 5 ай бұрын

If you had the money and the need you could fill a 1U server with 32 61.44TB E1.L SSDs and then fill a rack with 40 of them for a total of 78,643TB of RAW storage. Subtract 10% for redundancy and add 2x for dedupe/compression and you get 141,000TB usable in one rack. Or an Exabyte in 7 racks.

@nikhilPUD01 3 жыл бұрын

In few years "Super big data."

@recklessroges 3 жыл бұрын

Probably not as technology expands at a similar rate and the problem space doesn't change now that the the cluster has replaced the previous "mainframe" (single computer) approach.

@Abrifq 3 жыл бұрын

hungolomghononoloughongous data structures :3

@BlueyMcPhluey 3 жыл бұрын

thanks for this, understanding how to deal with big data is one elective I didn't have time for in my degree

@Skrat2k 3 жыл бұрын

Big data - any data set that crashes excel 😂

@godfather7339 3 жыл бұрын

Nah, excel crashes at like 1 million rows, that's not much actually...

@mathewsjoy8464 3 жыл бұрын

@@godfather7339 actually it is

@godfather7339 3 жыл бұрын

@@mathewsjoy8464 trust me, it's not, it's not at all.

@mathewsjoy8464 3 жыл бұрын

@@godfather7339 well you clearly don’t know anything, the expert in the video even said we can’t define how big or small data needs to be to be big data

@godfather7339 3 жыл бұрын

@@mathewsjoy8464 ik wht he defined, and I also know, PRACTICALLY, THT 1 MILLION ROWS IS NOTHING.

@Veptis 3 жыл бұрын

At my university there is a masters programme in data science and artificial intelligence. It's something I might go into after finishing my bachelor in computational linguistics. However I do need to do additional maths courses, which I haven't looked into yet. Apparently the supercomputer at the University has the largest memory in all of Europe. Which is 8 TB per nodd

@kees-janhermans910 3 жыл бұрын

'Scale out'? What happened to 'parallel processing'?

@malisa71 3 жыл бұрын

Didn't the meaning changed few years ago? Parallel processing is when it is working on the same or part of a problem at the same time. Horizontal scaling is when you can add nodes that does not need to work on the same problem at same time. Only the result will be merged in end. But the meaning is probably industry dependent.

@LupinoArts 3 жыл бұрын

"Big Data" Did you mean games by Paradox Interactive?

@mikejohnstonbob935 3 жыл бұрын

Paradox created/published Crysis?

@glieb 3 жыл бұрын

VQGAN + CLIP image synthesis video in the works I hope?? and suggest

@lookinforanick 3 жыл бұрын

Never seen a numberphile video with so much innuendo 🤣

@myothersoul1953 3 жыл бұрын

It's not the size of your data set that matters, nor is how many computers you use or the statistical they apply, what matters is how useful is the knowledge you extract.

@Pedritox0953 Жыл бұрын

Great video!

@laurendoe168 3 жыл бұрын

I think the prefix after "yotta" should be "lotta" LOL

@danceswithdirt7197 3 жыл бұрын

So he's just building up to talking about data striping right (I'm at 13:30 right now)? Is that it or am I missing something crucial?

@G5rry 3 жыл бұрын

Commenting on a video part-way through to ask a question. Do you expect an answer faster than just watching the video to the end first?

@danceswithdirt7197 3 жыл бұрын

@@G5rry No, I was predicting what the video was going to be about. I was mostly correct; I guess the two main concepts of the video were data striping and data locality.

@Goejii 3 жыл бұрын

44 ZB in total, so ~5 TB per person?

@busterdafydd3096 3 жыл бұрын

Yea. We will all interact with about 5TB of data in our life time if you think about it deeply

@ornessarhithfaeron3576 3 жыл бұрын

Me with a 4TB HDD: 👁️👄👁️

@EmrysCorbin 3 жыл бұрын

Yeeeeeah, 15 of those TB are on this current PC and it still seems kinda limited.

@rickysmyth 3 жыл бұрын

Have a drink every time he says DATE-AH

@sabriath 3 жыл бұрын

Well you went over scaling up and scaling out, but you missed scaling in. A big file that you are scanning through doesn't need all of the memory to load the entire file, you can do it in chunks and methodically. If you take that process and scale that out with the cluster, then you end up with an automated way of manipulating data. Scale the allocation code across the raid and you have automatic storage containment. Both together means that you don't have to worry about scale in any direction, it's all managed in the background for you.

@Georgesbarsukov 3 жыл бұрын

I prefer the strategy where I make everything super memory efficient and then go do something while it runs for a long time

@quintrankid8045 3 жыл бұрын

How many people miss the days of Fortran overlays? Anyone?

@isaactriguero3155 3 жыл бұрын

not me haha

@Ascania 3 жыл бұрын

Big Data is the concerted effort to prove "correlation does not imply causation" wrong.

@bluegizmo1983 3 жыл бұрын

How many more buzz words are you gonna cram into this interview? Big Data ✔️, Artificial Intelligence ✔️, Machine Learning ✔️.

@advanceringnewholder 3 жыл бұрын

Based on what I watch till 2:50, big data is Tony stark of data

@joeeeee8738 3 жыл бұрын

I have worked with Redshift and then with Snowflake. Snowflake solved the problems Redshift had by actually storing all the data efficiently in a central storage instead of storing in each machine. The paradigm is actually backwards now as storing is cheap (network is still the bottleneck)

@kellymoses8566 5 ай бұрын

Big data is what it takes a full rack of servers to store.

@MaksReveck 3 жыл бұрын

I think we can all agree that when you have to start using spark over pandas to process your datasets and save them on partitions rather than pure csvs then its big data

@kzuridabura8280 3 жыл бұрын

Try dask sometimes

@yashsvidixit7169 3 жыл бұрын

Didn't know Marc Márquez did Big data as well

@phunkym8 3 жыл бұрын

La dirección de la visitación de Concepción Zarzal

@RizwanAli-jy9ub 3 жыл бұрын

We should store information and lesser data

@klaesregis7487 2 жыл бұрын

16GB a lucky guy? Thats like the bare minmum for a developer these days. I want 64GB for my next upgrade in a year or so.

@Sprajt 3 жыл бұрын

Who buys more ram when you can just download it? smh

@_BWKC 3 жыл бұрын

Softram logic XD

@unl0ck998 3 жыл бұрын

That spanish accent *swoon*

@forthrightgambitia1032 3 жыл бұрын

"Everyone is talking about big data" - was this video recorded 5 years ago?

@malisa71 3 жыл бұрын

Why? If you work in this industry you will hear about it few times a month

@forthrightgambitia1032 3 жыл бұрын

@@malisa71 I haven't heard someone where I work in an unironic way for years. Maybe you're stuck working in some snake-oil consultancy though.

@malisa71 3 жыл бұрын

@@forthrightgambitia1032 This "consultancy" is around for almost 100 years and is one of top companies. I will gladly stay with them.

@forthrightgambitia1032 3 жыл бұрын

@@malisa71 Defensive, much?

@malisa71 3 жыл бұрын

@@forthrightgambitia1032 How is anything i wrote defensive?

@_..--- 3 жыл бұрын

44 zettabytes? seems like the term big data doesn't do it justice anymore

@advanceringnewholder 3 жыл бұрын

Weather data is big data isn't it

@VACatholic 3 жыл бұрын

No its tiny. There isn't that much of it (most weather data is highly localized and incredibly recent)

@dAntony1 3 жыл бұрын

As an American, I can hear both his Spanish and UK accents when he speaks. Sometimes in the same sentence.

@leovalenzuela8368 3 жыл бұрын

Haha I was just going to post that! It is fascinating hearing him slip back and forth between his native and adopted accents.

@isaactriguero3155 3 жыл бұрын

haha, this is very interesting! I don't think anyone here in the UK would hear my 'UK' accent haha

@serversurfer6169 3 жыл бұрын

I totally thought this video was gonna be about regulating Google and AWS… 🤓🤔😜

@Thinkingfeed 3 жыл бұрын

Apache Spark rulez!

@COPKALA 3 жыл бұрын

NICE: "if you use windows, it your own mistake" !!

@maschwab63 3 жыл бұрын

If you need 200+ servers, just run it on a IBM z Server as a plain jane computer task all by itself.

@malisa71 3 жыл бұрын

Did look at pricing of IBM z? My company is actively working on moving to luw and we are not small

@guilherme5094 3 жыл бұрын

Nice.

@khronos142 3 жыл бұрын

"smart data"

@grainfrizz 3 жыл бұрын

Rust is great

@shemmo 3 жыл бұрын

he uses word data so much, that i only hear data 🤔🤣

@isaactriguero3155 3 жыл бұрын

hahah, funny how repetitive one can become when doing this kind of video! hehe, sorry :-)

@shemmo 3 жыл бұрын

true true :) but i like his explanation

@jeffatturbofish 3 жыл бұрын

Here is my biggest problem with all of the definitions of 'big data' in that it requires multiple computer. What if it only requires multiple computers because the person who is 'analyzing' it, doesn't know how to deal with large data efficiently? Quality of data? I will just use SQL/SSIS to cleanse the data. I normally deal with data in the multiple TB range on either my laptop [not a typical laptop - 64 GB of ram], or my workstation [again, perhaps not a normal computer with 7 hard drives, mostly SSD, 128 GB of ram and a whole lot of cores] and can build an OLAP from the OLTP in minutes and then running more code doing some deeper analyst taking a few minutes more. If it takes more than 30 minutes, I know that I screwed something up. If you have to run it on multiple servers, maybe you also messed something up. Python is great for the little stuff [less than 1 GB], so is R, but for the big data, you need to work with something that can handle it. I have 'data scientist' friends with degrees from MIT who couldn't handle simple SQL and would freak out if they had more than a couple of MB of data to work with. In the meanwhile, I would handle TB of data in less time with SQL, SSIS, OLAP, MTX. Yeah, those are the dreaded Microsoft words.

@albertosimeoni7215 3 жыл бұрын

In enterprise environment you have other problem to handle... Availability made with redundancy of VM and disk over the network (That makes huge latency)... SSIS is considered a toy in big enterprises...other uses ODI, BODS (sap) that is more robust...the natural evolution of SSIS sold as "cloud" and "big data" is azure data factory...but the cost is the highest of every competitor...(you pay for every task you run rather than for the time the "machine is on")

@lucaspelegrino1 3 жыл бұрын

I want to see some Kafka

@AboveEmAllProduction 3 жыл бұрын

Do a hit everytime he says Right

@yfs9035 3 жыл бұрын

Where'd the British guy go what did you do with him!!!?? Who is this guy!!! Sorry I haven't even watched the video yet.

@KidNickles 3 жыл бұрын

Do a video on Raid storage! All this talk about big data and storage, I would love some videos on raid 5 and parity drives!

@austinskylines 3 жыл бұрын

ipfs

@drdca8263 3 жыл бұрын

Do you use it? I think it’s cool, but currently it competes a bit with my too-large-number of tabs, while I don’t get much use from actively running it, so I generally don’t keep it running? I guess that’s maybe just because I haven’t put in the work to find a use that fits my use cases?

@AxeJamie 3 жыл бұрын

I want to know what the largest data is...

@recklessroges 3 жыл бұрын

Depends on how you define the set. The LHC has one of the largest data bursts, but the entire Internet could be considered a single distributed cluster...

@quintrankid8045 3 жыл бұрын

largest amount of data in bits= (number of atoms in the universe - number of atoms required to keep you alive) / number of atoms required to store and process each bit(*) (*) Assumes that all atoms are equally useful for storing and processing data and keeping you alive. Also assumes that all the data needs to persist. Number of atoms required to keep you alive may vary by individual and requirements for food, entertainment and socialization. All calculations require integer results. Please consult with a quantum mechanic before proceeding.

@JimLeonard 3 жыл бұрын

Nearly two million subscribers, but still can't afford a tripod.

@treyquattro 3 жыл бұрын

so I'm screaming "Map-Reduce" (well, OK, internally screaming) and at the very end of the video we get there. What a tease!

@isaactriguero3155 3 жыл бұрын

there is another video explaining MapReduce! and I am planning to do some live coding videos in Python

@DorthLous 3 жыл бұрын

"1 gig of data". Look at my job as a dev. Look at my entertainment as games on Steam and videos. Yeaaaahhh....

@llortaton2834 3 жыл бұрын

He still misses dad to this day

@Random2 3 жыл бұрын

Ehm... It is very weird that scale in/out and scale up/down are being discussed in terms of big data, when those concepts are completely independent and predate the concept of big data as a whole... As a whole, after watching the entire video, this might be one of the least well-delineated videos in the entire channel. It mixes up parts of different concepts into one as if it all came from big data, or all of it is related to big data, while at the same time failing to address the historical origins of big data and map/reduce. Definitely below average for computerphile.

@DominicGiles 3 жыл бұрын

There's data.... Thats it...

@AudioPervert1 3 жыл бұрын

Not everyone is talking about big data 😭😭😭😂😂😂 these big data dudes never speak of the pollution, contamination and carbon generated by their marvellous technology. Big data could do nothing about the pandemic for example ...

@isaactriguero3155 3 жыл бұрын

well, I briefly mentioned the problem of sustainable Big Data, and I might be able to put together a video about this. You're right that not many people seem to care much about the number of resources a Big Data solution may use! This is where we should be going in research, trying to develop cost-effective AI, which only needs big data technology when strictly needed, and when is useful.

@lowpasslife 3 жыл бұрын

Cute accent

@NeThZOR 3 жыл бұрын

420 views... I see what you did there

@thekakan 3 жыл бұрын

Big data is data we don't know what we can do with it _yet_ 😉 ~~lemme have my fun~~ 6:08 when can we buy Computerphile GPUs? 🥺

@pdr. 3 жыл бұрын

This video felt more like marketing than education, sorry. Surely you just use whatever solution is appropriate for your problem, right? Get that hammer out of your hand before fixing the squeaky door.

@syntaxerorr 3 жыл бұрын

DoN'T UsE WinDOws....Linux: Let me introduce the OOM killer.

@kevinbatdorf 3 жыл бұрын

What? Buying more memory is cheaper than buying more computers… which just means you’re throwing more memory and cpu at it. I think you meant you solve it by writing a slower algorithm that uses less memory as the alternative. Also, buying more memory is often cheaper than the labor cost of refactoring, especially when it comes to distributed systems. Also, why the Windows hate? I don’t use Windows but still cringed there a bit

@malisa71 3 жыл бұрын

Time is money and nobody wants to wait for results. Solutions is to make fast and efficient programs that have proper memory utilisation. Almost no serious institution is using Windows for such tasks. Maybe on client side but not on a node or server.