From reading the comments, I think what some people are missing is the following: the value of MapReduce as it was implemented at Google (and in Hadoop) is that it is a framework for doing these sorts of jobs at scale. Furthermore, the framework allows one to plug in your map and reduce functions independently of everything else. That is why the shuffle is important for example. Often, people see the word count example and they want to just take that out and count at that point. And that might be better for this particular example, but the idea of the framework/algorithm is that it does not "know" at the shuffle phase that you are going to count. I think more examples would really help people understand it better.
@szhangster6 жыл бұрын
Professor Baclawski developed the search engine technology now called MapReduce in 1993. Northeastern University patented this technology, and the patent has not been successfully challenged. Northeastern University sued Google for patent infringement, and Google awarded a settlement to Northeastern University.
@TheDrLecter6 жыл бұрын
Thanks for the info
@NikolajLepka6 жыл бұрын
map and reduce also just happen to be the two most commonly used functions in functional programming, and have been for decades this tech isn't new, it's just being applied at a large scale
@BarbarianGod6 жыл бұрын
Glad I'm not the only one bothered by the opening of this video
@No0utlet6 жыл бұрын
I remember learning about Hadoop Map Reduce in 2013 and being completely confused about it. Thanks!
@philippk75546 жыл бұрын
You can actually use k-means with MapReduce. It basically works by splitting the data into sets and running k-means on different nodes with the same cluster values. After that, the new means are calculated and broadcasted to all nodes for the next iteration.
@cthudo6 жыл бұрын
I admit I'm an old fogy programmer... I have a massive problem with crediting this to Google. They *may* have been the ones to put a name on it, but that doesn't mean this isn't an exceedingly basic strategy of dividing work and collating the results that was being used by a lot of people for decades.
@Kachkavalj4 жыл бұрын
it's really refreshing to hear someone explain this in a straight forward way ... kudos to that young lady!
@cl0udbear6 жыл бұрын
I don't get it. So you shuffle the keys between the nodes so that each node is counting all the instances of the same key? Wouldn't that shuffling have been the perfect time to just do the count and get it done with?
@STDrepository6 жыл бұрын
In this example why wouldn't you just have each server count the words in its data and then send the results back to the operator's machine which adds them up?
@tomburns52316 жыл бұрын
I don't understand what a, b, and c are in the example. I thought they were words, but then the presenter calls them letters. In any case, given the example case of distributing a word count over parallel processors, why isn't the answer simply splitting the whole document/data into chucks, and asking the processors to count the words of one chuck each. Afterwards we can just sum the result. What am I missing?
@unvergebeneid6 жыл бұрын
Is it just me or was that not a particularly good example?
@viorelanghel55326 жыл бұрын
It's the classic map reduce example but you need to imagine 10TB of text, not 4 words a b c d
@KylePiira6 жыл бұрын
Perhaps not, but word counting is the typical Hello World project for MapReduce so it makes sense to use it.
@Ceelvain6 жыл бұрын
The word count is a good example. (Actually the best I came across.) She just should have choosen sentences with actual words, and put it in the context of a search engine.
@mrtoastyman076 жыл бұрын
This feels like the most impenetrable way of explaining this...
@MokhlesBouzaien5 жыл бұрын
It was confusing until I found a Computerphile video.
@IceMetalPunk6 жыл бұрын
Why does this allow duplicate keys on the same node instead of just one key per word with the value being the total count of the word on that node? Seems like that would speed up both the shuffle and reduce phases?
@Myziikhavus11 ай бұрын
This is a good video, no idea wtf the rest are yapping about. She actually goes through the main bit of how data shuffles and moves to different nodes
@brookead6 жыл бұрын
Hadoop as an approach to map reduce is so over engineered and complicated that most people avoid it. Also, unless you're at really big scale, the overhead of Hadoop is so big that any decent laptop can outperform it. That's specifically a Hadoop criticism, not a criticism of MR in general. Alternatives like DASK (simplistically: distributed python pandas) is much simpler to manage for most purposes... Not quite the same thing but ultimately, you pick your poison... And as noted by another commenter, the thing Google "pioneered" was _distributed_ MR. MR itself has been a part of many languages for many years... And the good explanation given in this video works just as well for those as it does for distributed versions thereof. :)
@theguitarhero6496 жыл бұрын
I just had to do a mapreduce implementation with grpc for my Advanced Operating Systems class that was due last night, wish this had come out earlier -_-
@steven16716 жыл бұрын
Is "reduce" the same as the fold function in Haskell?
@Yotanido6 жыл бұрын
It is
@Earthcomputer6 жыл бұрын
Yes
@klegsy6 жыл бұрын
Yes! Fold with the added benefit of directionality.
@233kosta6 жыл бұрын
Can you do a segment on the MPI protocol?
@MrTridac6 жыл бұрын
When will computer scientists stop putting new fancy names on decade old concepts?
@luciengrondin58026 жыл бұрын
When I hear about map/reduce I think about the corresponding commands in perl and that confuses me because they don't seem to be much related.
@monthlyduc6 жыл бұрын
Could you do a video on concurrency and possible solutions/history video? I'm surprised you haven't yet touched on sequential and concurrent execution. Great vid!
@mytech67796 жыл бұрын
They have covered concurrency within a CPU core, unless maybe you mean parallel execution.
@codewithstephen65765 ай бұрын
is it me or is everything complex just break it down -> bring it back together
@worlddaves6 жыл бұрын
Why do you always write on that particular type of paper in your videos?
@samsharma86216 жыл бұрын
They get that for free
@rich10514146 жыл бұрын
Consistency. Computerphile uses dot matrix paper while numberfiile uses brown paper towels.
@feldinho6 жыл бұрын
it's called "continuous form" and was historically used with daisy wheel-type printers (fix type face, size and line height) to ease the reading.
@SteelSkin6676 жыл бұрын
Because computers.
@saschar.87366 жыл бұрын
@@feldinho Yeah; I guess, they have a bunch of it lying around from that era. At home we also have (or had; don't know if my father threw it away, yet) a stack of similar paper from back when we had a dot matrix printer. And I imagine, that, at a university, they got tons of that stuff left.
@Gunbudder6 жыл бұрын
Map reduce gives me google interview flashbacks...
@man-alive6 жыл бұрын
if the mapping creates a single record for each occurence (e.g. a 1, b 1, c 1), why not just use an array (e.g. a, b, a), since the 1 is implied?
@ignacio5606 жыл бұрын
Thank you, amazing video! Any chance you’ll talk about Spark next? 😄
@TheAstronomyDude6 жыл бұрын
How would you implement this if you want to leave your data untouched? Mirror your data?
@Juan-Hdez2 жыл бұрын
Very useful. Thank you very much!
@sabriath6 жыл бұрын
I'm not seeing the practicality of this, shuffling seems to make it pointless. It's almost like seeing a huffman tree being formed, but then introducing it to some meth.
@allmhuran6 жыл бұрын
select count(...) group by ... Query plan: parallelism, stream aggregate But sure, tell me more about how SQL is outdated.
@lionp6966 жыл бұрын
Needed this a week ago for my exam :(
@alanp99362 ай бұрын
Thanks!
@mro23524 жыл бұрын
My previous project was with Apache spark which uses Hadoop and map reduce and we used it for batch jobs.
@MayankArora6 жыл бұрын
I love computerphile ❤️
@WilliamAndrea6 жыл бұрын
Why are there duplicate keys? Maybe it's cause I only know Python, but I'm really confused. Like for the first line of text, I would write the output of the map phase as a list of tuples, [('a', 1), ('b', 1), ('a', 1)], but wouldn't it make more sense as a dict, {'a': 2, 'b': 1} ?
@cediddi6 жыл бұрын
no love for filter :( it's map-reduce-filter (atleast for me)
@MarioWenzel6 жыл бұрын
filter can be part of either phase, depending on how you see it. filter p l = foldl (++) [] $ map (\x -> if p then [x] else []) l
@Master19066 жыл бұрын
She didn't explain it very well...
@Metruzanca6 жыл бұрын
Does the way this is done have any similarity to how array.prototype.map() and array.prototype.reduce() works in ES6 JS? Edit : i mean at a level of dividing the map process to different processes/threads/tasks to speed up compution.
@belst_6 жыл бұрын
map is not computed concurrently but sequential in js.
@geekworthy79386 жыл бұрын
Yeah, that was a pretty bad example and quite boring.
@coder0016 жыл бұрын
too much talking and not enough visual examples
@PilotPlater6 жыл бұрын
forget the nasty comments, I found this fascinating, and so did 94% of people who rated.
@ninocraft16 жыл бұрын
brilliant example of a practical application
@FahadKiani13 жыл бұрын
anyone knows who the genius is?
@__-to3hq6 жыл бұрын
the sound of felt pens makes me cringe
@tenseikenzx-35596 жыл бұрын
What's better MapReduce or CUDA?
@CJSwitz6 жыл бұрын
That's like comparing Apache Webserver to Fortran. They are completely different things.
@SteelSkin6676 жыл бұрын
I don't think these two are directly comparable. I'm pretty sure there are GPGPU implementations of MapReduce that make use of CUDA.
@IsYitzach6 жыл бұрын
MapReduce is a generic algorithm paradigm. There's a CUDA implementation for numerical purposes. This is a description for a cluster size problem rather than a GPU size problem.