Developer: we use a 3Gb database to plot some dashboards with statistical information about the customer's behavior Marketing team: we use big data, machine learning and artificial intelligence to analyze and predict customer's action at any given time.
@NeinStein3 жыл бұрын
But do you use blockchain?
@Dontcaredidntask-q9m3 жыл бұрын
Hahaha this is so true
@klontjespap3 жыл бұрын
exactly this.
@laurendoe1683 жыл бұрын
Not quite.... "Marketing team: we use big data, machine learning and artificial intelligence to MANIPULATE customer's action at any given time."
@johnjamesbaldridge8673 жыл бұрын
@@NeinStein With homomorphic encryption, no less!
@RealityIsNot3 жыл бұрын
The problem with the word big data is that it went from a technical jargon to a marketing one.. and marketing department don't care what the word means.. they create their own meaning 😀. Other examples include AI ML
@kilimanjarocruz6603 жыл бұрын
100% this!
@Piktogrammdd12343 жыл бұрын
Cyber!
@landsgevaer3 жыл бұрын
ML seems pretty well defined to me...? For AI and BD, I agree.
@danceswithdirt71973 жыл бұрын
Big data is two words. ;)
@nverwer3 жыл бұрын
More examples: exponential, agile, ...
@letsburn003 жыл бұрын
I never realised just how much information there was to store until I tried downloading half a decade of satellite images from a single satellite at a fairly low resolution. It was a quarter Terabyte per Channel and it was producing over a dozen channels. Then I had to process it....
@isaactriguero31553 жыл бұрын
what did you use to process it?
@letsburn003 жыл бұрын
@@isaactriguero3155 Python. I started with CV2 to convert to Numpy arrays, then work with Numpy. But it was taking forever until I learnt about Numba. Numba plus pure type Numpy arrays are astonishingly effective compared to pure python. I'll never look back now I got used to using Numba. I need my work to integrate with Tensorflow too, so Python works well with that.
@gherbihicham85063 жыл бұрын
@@letsburn00 yeah that's still not big data, since you are only using python on presumably on a single node cluster. If this data is comming in real time and needs to be processed instantly than you'll need streaming tools like Apache Kafka. To be stored analysed/mined than it needs special storage and processing engines like Hadoop, NoSQL stores and Spark, rarely do you use traditional RDBMs as stores unless they are special enterprise level appliances like Teradata, greenplume or Oracle appliances etc.. Data processed using traditional methods like single node machines and traditional programming language libraries are not Big data problems. Many people confuse that because they think big volumes of data are Big data.
@letsburn003 жыл бұрын
@@gherbihicham8506 Oh, I know it's not big data. "The Pile" is big data. This is just a tiny corner of the information that is available. But it was a bit of interesting perspective for me.
@thekakan3 жыл бұрын
@@NoNameAtAll2 I switch back and forth between Python, R and Julia. I love all three! Julia is the fastest, but R and Python have far better support and (usually) easier time in development. When you need the best compute power, Julia it is! It's quite amazing
@Jamiered183 жыл бұрын
It's interesting, because at my company we deal with petabytes of data. Yet, I'm not sure you could call that "big data", because it's not complex and it doesn't require multiple nodes to process.
@ktan83 жыл бұрын
But you'll probably need multiple nodes to store petabytes of data?
@jackgilcrest46323 жыл бұрын
@@ktan8 maybe only redundancy
@Beyondthecustomary3 жыл бұрын
@@ktan8 large amounts of data are often stored in raid for speed and redundancy.
@mattcelder3 жыл бұрын
That's why big data isn't the same thing as "large volume". "Large" is subjective and largely dependant on your point in time. 30 years ago, you could've said "my company deals with gigabytes of data" and that would've sounded ridiculously huge, like petabytes do today. But today we wouldn't call "gigabytes" big data. For the same reason, we wouldn't call "petabytes" "big data" unless there's more to it than sheer volume.
@AyushPoddar3 жыл бұрын
@@Beyondthecustomary not necessarily, most of the large data I've seen being stored (think pb) is stored in a distributed storage like HDFS which came out of Google GFS, since RAID would provide redundancy and fault tolerance but there are no HD that I know of that can store a single PB file and it'll surely not be inexpensive as RAID suggests.
@SlackWi3 жыл бұрын
i work in bioinformatics and i would totally agree that 'big data' is anything i have to run on our university cluster
@urooj093 жыл бұрын
@you- tube well you have to study biology a bit. In my college bioinformatics students take atleast one semester if biology and then they take courses on biology depending on what they wanna code.
@kevinhayes60573 жыл бұрын
"Big Data" is talked about everywhere now. Really great to hear an explanation of it's fundamentals.
@AboveEmAllProduction3 жыл бұрын
More like 10 years ago it was talked about alot
@codycast3 жыл бұрын
@@AboveEmAllProduction no, gender studies
@iammaxhailme3 жыл бұрын
I used to work in computational chemistry... I had to use large GPU-driven compute clusters to do my simulations, but I wouldn't call it big data. I'd call it "big calculator that crunches molecular dynamics for a week and then pops out a 2 mb results .txt file" lol
@igorsmihailovs523 жыл бұрын
Did you use network storage for MD? Because I was surprised to hear in this video how it is specific to big data. I am doing CC now but not MD, QC.
@iammaxhailme3 жыл бұрын
@@igorsmihailovs52 Not really. SSH into massive GPU comp cluster, start the simulation, SCP the results files (which were a few gigs at most) back to my own PC. Rinse and repeat.
@KilgoreTroutAsf3 жыл бұрын
Coordinates are usually saved only once every few hundred steps, with intermediate configurations being highly redundant and easy to reconstruct from the nearest snapshot. Because of that MD files are typically not very large.
@mokopa3 жыл бұрын
"If you're using Windows, that's your own mistake" INSTANT LIKE + FAVORITE
@leahshitindi8365 Жыл бұрын
We had three hours lecture with Isaac last month. It was very interesting
@gubbin9093 жыл бұрын
Would love to see some future videos on Apache Spark!
@recklessroges3 жыл бұрын
Yes. There is so much more to talk about on this topic. I'd like to hear about ceph and Tahoe-LAFS.
@isaactriguero31553 жыл бұрын
I am working on it :-)
@thisisneeraj71333 жыл бұрын
*Apache Hadoop enters the chat*
@albertosimeoni72153 жыл бұрын
Better to spend some words even on apache Druid
@sandraviknander78983 жыл бұрын
Freaky! I had this exact need of data locality on our cluster for the first time in my work this week.
@NoEgg4u3 жыл бұрын
@3:23 "...the digital universe was estimated to be 44 zeta-bytes", and half of that is adult videos.
@Sharp9313 жыл бұрын
*doubt*
@Phroggster3 жыл бұрын
@@Sharp931 You're right, it's probably more like two-thirds.
@G5rry3 жыл бұрын
The other half is cats
@lightspiritblix14233 жыл бұрын
I'm actually studying these concepts at college, this video could not have come at a more convenient time!
@TheMagicToyChest3 жыл бұрын
Stay focused and diligent, friend.
@nandafprado3 жыл бұрын
"If you are using windows that is your own mistake" ...well that is the hard truth for data scientists lol
@sagnikbhattacharya12023 жыл бұрын
5:10 "If you're using Windows, that's your own mistake" truer words have never been spoken
@evilsqirrel3 жыл бұрын
As someone who works more on the practical side of this field, it really is a huge problem to solve. I work with data sets where we feed in multiple terabytes per day, and making sure the infrastructure stays healthy is a huge undertaking. It's cool to see it broken down in a digestible manner like this.
@shiolei3 жыл бұрын
Awesome simple explanation and diagrams. Loved this breakdown!
@jaffarbh2 жыл бұрын
One handy trick is to reduce the number of "reductions" in a map-reduce task. In other words, more training, less validation. The downside this could mean the training coverages more slowly
@quanta83823 жыл бұрын
Take a drink everytime they say data for the ultimate experience
@seancooper89183 жыл бұрын
We call this approach "Dealing With Big Drinking".
@chsyank3 жыл бұрын
Interesting video. I worked on and designed big data building large databases for litigation in the early 1980... that was big at the time. Then a few years later creating big data for shopping analysis. The key is that big data is big for the years that you are working on it and not afterwards as storage and processing gets bigger and faster. I think that while analysis and reporting is important, (otherwise there is no value to the data) I do believe that designing and building proper ingestion and storage designs are as important. My two cents from over 30 years of building big data.
@GloriousSimplicity3 жыл бұрын
The industry is moving away from having long-term storage on compute nodes. Since data storage needs grow at a different rate than compute needs, the trend is to have a storage cluster and a compute cluster. This means that applications start a bit slower as the data must be transferred from the storage cluster to the compute cluster. However it allows for more efficient spending on commodity hardware.
@kellymoses85665 ай бұрын
If you had the money and the need you could fill a 1U server with 32 61.44TB E1.L SSDs and then fill a rack with 40 of them for a total of 78,643TB of RAW storage. Subtract 10% for redundancy and add 2x for dedupe/compression and you get 141,000TB usable in one rack. Or an Exabyte in 7 racks.
@nikhilPUD013 жыл бұрын
In few years "Super big data."
@recklessroges3 жыл бұрын
Probably not as technology expands at a similar rate and the problem space doesn't change now that the the cluster has replaced the previous "mainframe" (single computer) approach.
@Abrifq3 жыл бұрын
hungolomghononoloughongous data structures :3
@BlueyMcPhluey3 жыл бұрын
thanks for this, understanding how to deal with big data is one elective I didn't have time for in my degree
@Skrat2k3 жыл бұрын
Big data - any data set that crashes excel 😂
@godfather73393 жыл бұрын
Nah, excel crashes at like 1 million rows, that's not much actually...
@mathewsjoy84643 жыл бұрын
@@godfather7339 actually it is
@godfather73393 жыл бұрын
@@mathewsjoy8464 trust me, it's not, it's not at all.
@mathewsjoy84643 жыл бұрын
@@godfather7339 well you clearly don’t know anything, the expert in the video even said we can’t define how big or small data needs to be to be big data
@godfather73393 жыл бұрын
@@mathewsjoy8464 ik wht he defined, and I also know, PRACTICALLY, THT 1 MILLION ROWS IS NOTHING.
@Veptis3 жыл бұрын
At my university there is a masters programme in data science and artificial intelligence. It's something I might go into after finishing my bachelor in computational linguistics. However I do need to do additional maths courses, which I haven't looked into yet. Apparently the supercomputer at the University has the largest memory in all of Europe. Which is 8 TB per nodd
@kees-janhermans9103 жыл бұрын
'Scale out'? What happened to 'parallel processing'?
@malisa713 жыл бұрын
Didn't the meaning changed few years ago? Parallel processing is when it is working on the same or part of a problem at the same time. Horizontal scaling is when you can add nodes that does not need to work on the same problem at same time. Only the result will be merged in end. But the meaning is probably industry dependent.
@LupinoArts3 жыл бұрын
"Big Data" Did you mean games by Paradox Interactive?
@mikejohnstonbob9353 жыл бұрын
Paradox created/published Crysis?
@glieb3 жыл бұрын
VQGAN + CLIP image synthesis video in the works I hope?? and suggest
@lookinforanick3 жыл бұрын
Never seen a numberphile video with so much innuendo 🤣
@myothersoul19533 жыл бұрын
It's not the size of your data set that matters, nor is how many computers you use or the statistical they apply, what matters is how useful is the knowledge you extract.
@Pedritox0953 Жыл бұрын
Great video!
@laurendoe1683 жыл бұрын
I think the prefix after "yotta" should be "lotta" LOL
@danceswithdirt71973 жыл бұрын
So he's just building up to talking about data striping right (I'm at 13:30 right now)? Is that it or am I missing something crucial?
@G5rry3 жыл бұрын
Commenting on a video part-way through to ask a question. Do you expect an answer faster than just watching the video to the end first?
@danceswithdirt71973 жыл бұрын
@@G5rry No, I was predicting what the video was going to be about. I was mostly correct; I guess the two main concepts of the video were data striping and data locality.
@Goejii3 жыл бұрын
44 ZB in total, so ~5 TB per person?
@busterdafydd30963 жыл бұрын
Yea. We will all interact with about 5TB of data in our life time if you think about it deeply
@ornessarhithfaeron35763 жыл бұрын
Me with a 4TB HDD: 👁️👄👁️
@EmrysCorbin3 жыл бұрын
Yeeeeeah, 15 of those TB are on this current PC and it still seems kinda limited.
@rickysmyth3 жыл бұрын
Have a drink every time he says DATE-AH
@sabriath3 жыл бұрын
Well you went over scaling up and scaling out, but you missed scaling in. A big file that you are scanning through doesn't need all of the memory to load the entire file, you can do it in chunks and methodically. If you take that process and scale that out with the cluster, then you end up with an automated way of manipulating data. Scale the allocation code across the raid and you have automatic storage containment. Both together means that you don't have to worry about scale in any direction, it's all managed in the background for you.
@Georgesbarsukov3 жыл бұрын
I prefer the strategy where I make everything super memory efficient and then go do something while it runs for a long time
@quintrankid80453 жыл бұрын
How many people miss the days of Fortran overlays? Anyone?
@isaactriguero31553 жыл бұрын
not me haha
@Ascania3 жыл бұрын
Big Data is the concerted effort to prove "correlation does not imply causation" wrong.
@bluegizmo19833 жыл бұрын
How many more buzz words are you gonna cram into this interview? Big Data ✔️, Artificial Intelligence ✔️, Machine Learning ✔️.
@advanceringnewholder3 жыл бұрын
Based on what I watch till 2:50, big data is Tony stark of data
@joeeeee87383 жыл бұрын
I have worked with Redshift and then with Snowflake. Snowflake solved the problems Redshift had by actually storing all the data efficiently in a central storage instead of storing in each machine. The paradigm is actually backwards now as storing is cheap (network is still the bottleneck)
@kellymoses85665 ай бұрын
Big data is what it takes a full rack of servers to store.
@MaksReveck3 жыл бұрын
I think we can all agree that when you have to start using spark over pandas to process your datasets and save them on partitions rather than pure csvs then its big data
@kzuridabura82803 жыл бұрын
Try dask sometimes
@yashsvidixit71693 жыл бұрын
Didn't know Marc Márquez did Big data as well
@phunkym83 жыл бұрын
La dirección de la visitación de Concepción Zarzal
@RizwanAli-jy9ub3 жыл бұрын
We should store information and lesser data
@klaesregis74872 жыл бұрын
16GB a lucky guy? Thats like the bare minmum for a developer these days. I want 64GB for my next upgrade in a year or so.
@Sprajt3 жыл бұрын
Who buys more ram when you can just download it? smh
@_BWKC3 жыл бұрын
Softram logic XD
@unl0ck9983 жыл бұрын
That spanish accent *swoon*
@forthrightgambitia10323 жыл бұрын
"Everyone is talking about big data" - was this video recorded 5 years ago?
@malisa713 жыл бұрын
Why? If you work in this industry you will hear about it few times a month
@forthrightgambitia10323 жыл бұрын
@@malisa71 I haven't heard someone where I work in an unironic way for years. Maybe you're stuck working in some snake-oil consultancy though.
@malisa713 жыл бұрын
@@forthrightgambitia1032 This "consultancy" is around for almost 100 years and is one of top companies. I will gladly stay with them.
@forthrightgambitia10323 жыл бұрын
@@malisa71 Defensive, much?
@malisa713 жыл бұрын
@@forthrightgambitia1032 How is anything i wrote defensive?
@_..---3 жыл бұрын
44 zettabytes? seems like the term big data doesn't do it justice anymore
@advanceringnewholder3 жыл бұрын
Weather data is big data isn't it
@VACatholic3 жыл бұрын
No its tiny. There isn't that much of it (most weather data is highly localized and incredibly recent)
@dAntony13 жыл бұрын
As an American, I can hear both his Spanish and UK accents when he speaks. Sometimes in the same sentence.
@leovalenzuela83683 жыл бұрын
Haha I was just going to post that! It is fascinating hearing him slip back and forth between his native and adopted accents.
@isaactriguero31553 жыл бұрын
haha, this is very interesting! I don't think anyone here in the UK would hear my 'UK' accent haha
@serversurfer61693 жыл бұрын
I totally thought this video was gonna be about regulating Google and AWS… 🤓🤔😜
@Thinkingfeed3 жыл бұрын
Apache Spark rulez!
@COPKALA3 жыл бұрын
NICE: "if you use windows, it your own mistake" !!
@maschwab633 жыл бұрын
If you need 200+ servers, just run it on a IBM z Server as a plain jane computer task all by itself.
@malisa713 жыл бұрын
Did look at pricing of IBM z? My company is actively working on moving to luw and we are not small
@guilherme50943 жыл бұрын
Nice.
@khronos1423 жыл бұрын
"smart data"
@grainfrizz3 жыл бұрын
Rust is great
@shemmo3 жыл бұрын
he uses word data so much, that i only hear data 🤔🤣
@isaactriguero31553 жыл бұрын
hahah, funny how repetitive one can become when doing this kind of video! hehe, sorry :-)
@shemmo3 жыл бұрын
true true :) but i like his explanation
@jeffatturbofish3 жыл бұрын
Here is my biggest problem with all of the definitions of 'big data' in that it requires multiple computer. What if it only requires multiple computers because the person who is 'analyzing' it, doesn't know how to deal with large data efficiently? Quality of data? I will just use SQL/SSIS to cleanse the data. I normally deal with data in the multiple TB range on either my laptop [not a typical laptop - 64 GB of ram], or my workstation [again, perhaps not a normal computer with 7 hard drives, mostly SSD, 128 GB of ram and a whole lot of cores] and can build an OLAP from the OLTP in minutes and then running more code doing some deeper analyst taking a few minutes more. If it takes more than 30 minutes, I know that I screwed something up. If you have to run it on multiple servers, maybe you also messed something up. Python is great for the little stuff [less than 1 GB], so is R, but for the big data, you need to work with something that can handle it. I have 'data scientist' friends with degrees from MIT who couldn't handle simple SQL and would freak out if they had more than a couple of MB of data to work with. In the meanwhile, I would handle TB of data in less time with SQL, SSIS, OLAP, MTX. Yeah, those are the dreaded Microsoft words.
@albertosimeoni72153 жыл бұрын
In enterprise environment you have other problem to handle... Availability made with redundancy of VM and disk over the network (That makes huge latency)... SSIS is considered a toy in big enterprises...other uses ODI, BODS (sap) that is more robust...the natural evolution of SSIS sold as "cloud" and "big data" is azure data factory...but the cost is the highest of every competitor...(you pay for every task you run rather than for the time the "machine is on")
@lucaspelegrino13 жыл бұрын
I want to see some Kafka
@AboveEmAllProduction3 жыл бұрын
Do a hit everytime he says Right
@yfs90353 жыл бұрын
Where'd the British guy go what did you do with him!!!?? Who is this guy!!! Sorry I haven't even watched the video yet.
@KidNickles3 жыл бұрын
Do a video on Raid storage! All this talk about big data and storage, I would love some videos on raid 5 and parity drives!
@austinskylines3 жыл бұрын
ipfs
@drdca82633 жыл бұрын
Do you use it? I think it’s cool, but currently it competes a bit with my too-large-number of tabs, while I don’t get much use from actively running it, so I generally don’t keep it running? I guess that’s maybe just because I haven’t put in the work to find a use that fits my use cases?
@AxeJamie3 жыл бұрын
I want to know what the largest data is...
@recklessroges3 жыл бұрын
Depends on how you define the set. The LHC has one of the largest data bursts, but the entire Internet could be considered a single distributed cluster...
@quintrankid80453 жыл бұрын
largest amount of data in bits= (number of atoms in the universe - number of atoms required to keep you alive) / number of atoms required to store and process each bit(*) (*) Assumes that all atoms are equally useful for storing and processing data and keeping you alive. Also assumes that all the data needs to persist. Number of atoms required to keep you alive may vary by individual and requirements for food, entertainment and socialization. All calculations require integer results. Please consult with a quantum mechanic before proceeding.
@JimLeonard3 жыл бұрын
Nearly two million subscribers, but still can't afford a tripod.
@treyquattro3 жыл бұрын
so I'm screaming "Map-Reduce" (well, OK, internally screaming) and at the very end of the video we get there. What a tease!
@isaactriguero31553 жыл бұрын
there is another video explaining MapReduce! and I am planning to do some live coding videos in Python
@DorthLous3 жыл бұрын
"1 gig of data". Look at my job as a dev. Look at my entertainment as games on Steam and videos. Yeaaaahhh....
@llortaton28343 жыл бұрын
He still misses dad to this day
@Random23 жыл бұрын
Ehm... It is very weird that scale in/out and scale up/down are being discussed in terms of big data, when those concepts are completely independent and predate the concept of big data as a whole... As a whole, after watching the entire video, this might be one of the least well-delineated videos in the entire channel. It mixes up parts of different concepts into one as if it all came from big data, or all of it is related to big data, while at the same time failing to address the historical origins of big data and map/reduce. Definitely below average for computerphile.
@DominicGiles3 жыл бұрын
There's data.... Thats it...
@AudioPervert13 жыл бұрын
Not everyone is talking about big data 😭😭😭😂😂😂 these big data dudes never speak of the pollution, contamination and carbon generated by their marvellous technology. Big data could do nothing about the pandemic for example ...
@isaactriguero31553 жыл бұрын
well, I briefly mentioned the problem of sustainable Big Data, and I might be able to put together a video about this. You're right that not many people seem to care much about the number of resources a Big Data solution may use! This is where we should be going in research, trying to develop cost-effective AI, which only needs big data technology when strictly needed, and when is useful.
@lowpasslife3 жыл бұрын
Cute accent
@NeThZOR3 жыл бұрын
420 views... I see what you did there
@thekakan3 жыл бұрын
Big data is data we don't know what we can do with it _yet_ 😉 ~~lemme have my fun~~ 6:08 when can we buy Computerphile GPUs? 🥺
@pdr.3 жыл бұрын
This video felt more like marketing than education, sorry. Surely you just use whatever solution is appropriate for your problem, right? Get that hammer out of your hand before fixing the squeaky door.
@syntaxerorr3 жыл бұрын
DoN'T UsE WinDOws....Linux: Let me introduce the OOM killer.
@kevinbatdorf3 жыл бұрын
What? Buying more memory is cheaper than buying more computers… which just means you’re throwing more memory and cpu at it. I think you meant you solve it by writing a slower algorithm that uses less memory as the alternative. Also, buying more memory is often cheaper than the labor cost of refactoring, especially when it comes to distributed systems. Also, why the Windows hate? I don’t use Windows but still cringed there a bit
@malisa713 жыл бұрын
Time is money and nobody wants to wait for results. Solutions is to make fast and efficient programs that have proper memory utilisation. Almost no serious institution is using Windows for such tasks. Maybe on client side but not on a node or server.
@yukisetsuna13253 жыл бұрын
first
@darraghtate4403 жыл бұрын
The bards shall sing of this victory in the annals of time.
@vzr3143 жыл бұрын
No. Everyone is talking about COVID. And I listened him until he mentioned COVID in first few minutes, enough nof broken English anyway