MapReduce - Computerphile

Рет қаралды 263,009

Күн бұрын

Пікірлер: 65

@jaybakerwork 6 жыл бұрын

From reading the comments, I think what some people are missing is the following: the value of MapReduce as it was implemented at Google (and in Hadoop) is that it is a framework for doing these sorts of jobs at scale. Furthermore, the framework allows one to plug in your map and reduce functions independently of everything else. That is why the shuffle is important for example. Often, people see the word count example and they want to just take that out and count at that point. And that might be better for this particular example, but the idea of the framework/algorithm is that it does not "know" at the shuffle phase that you are going to count. I think more examples would really help people understand it better.

@szhangster 6 жыл бұрын

Professor Baclawski developed the search engine technology now called MapReduce in 1993. Northeastern University patented this technology, and the patent has not been successfully challenged. Northeastern University sued Google for patent infringement, and Google awarded a settlement to Northeastern University.

@TheDrLecter 6 жыл бұрын

Thanks for the info

@NikolajLepka 6 жыл бұрын

map and reduce also just happen to be the two most commonly used functions in functional programming, and have been for decades this tech isn't new, it's just being applied at a large scale

@BarbarianGod 6 жыл бұрын

Glad I'm not the only one bothered by the opening of this video

@No0utlet 6 жыл бұрын

I remember learning about Hadoop Map Reduce in 2013 and being completely confused about it. Thanks!

@philippk7554 6 жыл бұрын

You can actually use k-means with MapReduce. It basically works by splitting the data into sets and running k-means on different nodes with the same cluster values. After that, the new means are calculated and broadcasted to all nodes for the next iteration.

@cthudo 6 жыл бұрын

I admit I'm an old fogy programmer... I have a massive problem with crediting this to Google. They *may* have been the ones to put a name on it, but that doesn't mean this isn't an exceedingly basic strategy of dividing work and collating the results that was being used by a lot of people for decades.

@Kachkavalj 4 жыл бұрын

it's really refreshing to hear someone explain this in a straight forward way ... kudos to that young lady!

@cl0udbear 6 жыл бұрын

I don't get it. So you shuffle the keys between the nodes so that each node is counting all the instances of the same key? Wouldn't that shuffling have been the perfect time to just do the count and get it done with?

@STDrepository 6 жыл бұрын

In this example why wouldn't you just have each server count the words in its data and then send the results back to the operator's machine which adds them up?

@tomburns5231 6 жыл бұрын

I don't understand what a, b, and c are in the example. I thought they were words, but then the presenter calls them letters. In any case, given the example case of distributing a word count over parallel processors, why isn't the answer simply splitting the whole document/data into chucks, and asking the processors to count the words of one chuck each. Afterwards we can just sum the result. What am I missing?

@unvergebeneid 6 жыл бұрын

Is it just me or was that not a particularly good example?

@viorelanghel5532 6 жыл бұрын

It's the classic map reduce example but you need to imagine 10TB of text, not 4 words a b c d

@KylePiira 6 жыл бұрын

Perhaps not, but word counting is the typical Hello World project for MapReduce so it makes sense to use it.

@Ceelvain 6 жыл бұрын

The word count is a good example. (Actually the best I came across.) She just should have choosen sentences with actual words, and put it in the context of a search engine.

@mrtoastyman07 6 жыл бұрын

This feels like the most impenetrable way of explaining this...

@MokhlesBouzaien 5 жыл бұрын

It was confusing until I found a Computerphile video.

@IceMetalPunk 6 жыл бұрын

Why does this allow duplicate keys on the same node instead of just one key per word with the value being the total count of the word on that node? Seems like that would speed up both the shuffle and reduce phases?

@Myziikhavus 11 ай бұрын

This is a good video, no idea wtf the rest are yapping about. She actually goes through the main bit of how data shuffles and moves to different nodes

@brookead 6 жыл бұрын

Hadoop as an approach to map reduce is so over engineered and complicated that most people avoid it. Also, unless you're at really big scale, the overhead of Hadoop is so big that any decent laptop can outperform it. That's specifically a Hadoop criticism, not a criticism of MR in general. Alternatives like DASK (simplistically: distributed python pandas) is much simpler to manage for most purposes... Not quite the same thing but ultimately, you pick your poison... And as noted by another commenter, the thing Google "pioneered" was _distributed_ MR. MR itself has been a part of many languages for many years... And the good explanation given in this video works just as well for those as it does for distributed versions thereof. :)

@theguitarhero649 6 жыл бұрын

I just had to do a mapreduce implementation with grpc for my Advanced Operating Systems class that was due last night, wish this had come out earlier -_-

@steven1671 6 жыл бұрын

Is "reduce" the same as the fold function in Haskell?

@Yotanido 6 жыл бұрын

It is

@Earthcomputer 6 жыл бұрын

Yes

@klegsy 6 жыл бұрын

Yes! Fold with the added benefit of directionality.

@233kosta 6 жыл бұрын

Can you do a segment on the MPI protocol?

@MrTridac 6 жыл бұрын

When will computer scientists stop putting new fancy names on decade old concepts?

@luciengrondin5802 6 жыл бұрын

When I hear about map/reduce I think about the corresponding commands in perl and that confuses me because they don't seem to be much related.

@monthlyduc 6 жыл бұрын

Could you do a video on concurrency and possible solutions/history video? I'm surprised you haven't yet touched on sequential and concurrent execution. Great vid!

@mytech6779 6 жыл бұрын

They have covered concurrency within a CPU core, unless maybe you mean parallel execution.

@codewithstephen6576 5 ай бұрын

is it me or is everything complex just break it down -> bring it back together

@worlddaves 6 жыл бұрын

Why do you always write on that particular type of paper in your videos?

@samsharma8621 6 жыл бұрын

They get that for free

@rich1051414 6 жыл бұрын

Consistency. Computerphile uses dot matrix paper while numberfiile uses brown paper towels.

@feldinho 6 жыл бұрын

it's called "continuous form" and was historically used with daisy wheel-type printers (fix type face, size and line height) to ease the reading.

@SteelSkin667 6 жыл бұрын

Because computers.

@saschar.8736 6 жыл бұрын

@@feldinho Yeah; I guess, they have a bunch of it lying around from that era. At home we also have (or had; don't know if my father threw it away, yet) a stack of similar paper from back when we had a dot matrix printer. And I imagine, that, at a university, they got tons of that stuff left.

@Gunbudder 6 жыл бұрын

Map reduce gives me google interview flashbacks...

@man-alive 6 жыл бұрын

if the mapping creates a single record for each occurence (e.g. a 1, b 1, c 1), why not just use an array (e.g. a, b, a), since the 1 is implied?

@ignacio560 6 жыл бұрын

Thank you, amazing video! Any chance you’ll talk about Spark next? 😄

@TheAstronomyDude 6 жыл бұрын

How would you implement this if you want to leave your data untouched? Mirror your data?

@Juan-Hdez 2 жыл бұрын

Very useful. Thank you very much!

@sabriath 6 жыл бұрын

I'm not seeing the practicality of this, shuffling seems to make it pointless. It's almost like seeing a huffman tree being formed, but then introducing it to some meth.

@allmhuran 6 жыл бұрын

select count(...) group by ... Query plan: parallelism, stream aggregate But sure, tell me more about how SQL is outdated.

@lionp696 6 жыл бұрын

Needed this a week ago for my exam :(

@alanp9936 2 ай бұрын

Thanks!

@mro2352 4 жыл бұрын

My previous project was with Apache spark which uses Hadoop and map reduce and we used it for batch jobs.

@MayankArora 6 жыл бұрын

I love computerphile ❤️

@WilliamAndrea 6 жыл бұрын

Why are there duplicate keys? Maybe it's cause I only know Python, but I'm really confused. Like for the first line of text, I would write the output of the map phase as a list of tuples, [('a', 1), ('b', 1), ('a', 1)], but wouldn't it make more sense as a dict, {'a': 2, 'b': 1} ?

@cediddi 6 жыл бұрын

no love for filter :( it's map-reduce-filter (atleast for me)

@MarioWenzel 6 жыл бұрын

filter can be part of either phase, depending on how you see it. filter p l = foldl (++) [] $ map (\x -> if p then [x] else []) l

@Master1906 6 жыл бұрын

She didn't explain it very well...

@Metruzanca 6 жыл бұрын

Does the way this is done have any similarity to how array.prototype.map() and array.prototype.reduce() works in ES6 JS? Edit : i mean at a level of dividing the map process to different processes/threads/tasks to speed up compution.

@belst_ 6 жыл бұрын

map is not computed concurrently but sequential in js.

@geekworthy7938 6 жыл бұрын

Yeah, that was a pretty bad example and quite boring.

@coder001 6 жыл бұрын

too much talking and not enough visual examples

@PilotPlater 6 жыл бұрын

forget the nasty comments, I found this fascinating, and so did 94% of people who rated.

@ninocraft1 6 жыл бұрын

brilliant example of a practical application

@FahadKiani1 3 жыл бұрын

anyone knows who the genius is?

@__-to3hq 6 жыл бұрын

the sound of felt pens makes me cringe

@tenseikenzx-3559 6 жыл бұрын

What's better MapReduce or CUDA?

@CJSwitz 6 жыл бұрын

That's like comparing Apache Webserver to Fortran. They are completely different things.

@SteelSkin667 6 жыл бұрын

I don't think these two are directly comparable. I'm pretty sure there are GPGPU implementations of MapReduce that make use of CUDA.

@IsYitzach 6 жыл бұрын

MapReduce is a generic algorithm paradigm. There's a CUDA implementation for numerical purposes. This is a description for a cluster size problem rather than a GPU size problem.