Burrows-Wheeler Transform (Ep 4, Compressor Head) Google

Рет қаралды 31,354

Күн бұрын

Пікірлер: 50

@GoogleDevelopers 9 жыл бұрын

*The new Compressor Head episode is finally here!* We know you've been waiting on pins and needles for the next episode of Compressor Head, and it's finally here! Colt McAnlis returns as the algorithm obsessed geek who loves to share his knowledge with the programmers of the world. He’s joined by fellow Googler Mike Burrows, co-inventor of the Burrows-Wheeler transform, an almost magical algorithm that will turn your head inside out. Stay tuned to the Google Developers KZbin channel (kzbin.info) for the soon to be released episodes 5 and 6, coming your way Friday mornings from the Googleplex in Mountain View CA. #CompressorHead #PerfMatters

@roberttompkins1172 9 жыл бұрын

Nice

@haroldlcarpenter8481 9 жыл бұрын

chrome-dome nerd

@ColtMcAnlis 9 жыл бұрын

Compressor Head is BACK! And I'm going "bananas" over Episode 4! (....you won't get the joke until you watch the episode....)

@ColtMcAnlis 9 жыл бұрын

Karthik Kumar Viswanathan I've ranted about that exact problem before. There's a plethora of existing formats out there to handle this, devs just aren't embracing them (protobuffs, gRPC, captn proto, flatbuffers etc)

@danielschreck9083 4 жыл бұрын

great video, one error: Magnus does the sort wrong, and the final matrix comes out wrong as a result: at 7:58, for example, with 3 characters in each row, Magnus sorts NAN before NAB. The final matrix (8:17) has last column BNNAAA (doesn't match the original last column), should be NNBAAA if sorts are done correctly in the invert phase.

@АдисХоджаяров 4 жыл бұрын

that's why Compressor Head said: "I don't think he got a shot at that one".

@vonkruel 9 жыл бұрын

I remember reading that article in Dr. Dobb's Journal, and implementing this algorithm was a great exercise. It's a very special algorithm. I _loved_ that story about declining to re-submit the paper, and saying "it's my policy not to explain these decisions." Golden!

@amitsubhash1586 Жыл бұрын

This was released 8 years aho and I am still able to learn from it , thank you team google!

@yuvrajgarg3921 5 жыл бұрын

Thanks Google Glad to hear words directly from Sir Michael Burrows .

@jackninja1 9 жыл бұрын

These video's are amazing, really like the amount of detail and the length that they have, keep up the good work!

@mgehad 8 жыл бұрын

I can't find the words to describe this. this such so interesting and fun. You're such an amazing head man! Thanks.

@BryceDoesLife 7 жыл бұрын

My left ear loved that intro.

@asedaaboagye5877 9 жыл бұрын

Another awesome episode!

@silkwurm 4 жыл бұрын

Entropy doesn’t “fail to take into account the order of the symbols”, that’s a problem with character codes, not entropy. Entropy is defined for the joint distribution, order and all

@boushaft3459 5 жыл бұрын

Hey man, great video, really entertaining!

@runlaketahoe 9 жыл бұрын

Will have to put #houseofcards on hold for a bit to watch Compressor Head!

@RakhiDhavale 9 жыл бұрын

Great to know about BWT !

@matthewpobox 3 жыл бұрын

what was the point of 2:55?

@zihuatanejo7741 4 жыл бұрын

from coursera algorithm Ⅱ,the specification of the last assignment is pretty arcane...

@yohanrizk3398 9 жыл бұрын

Woah this stuff is coool! I was following along the delta coding part, and I might be totally be wrong about this , but isn't the delta coding of [9,2,1,3,4,8,0,6,7,5] equal to [9,-7,-1,2,1,4,-8,6,1,-2] ? your calculation has -1 at the end. Great Video, BWT really is something else.

@diegonayalazo 3 жыл бұрын

Thanks

@yohanrizk3398 8 жыл бұрын

Also... "Competetive sorter" is misspelled , the correct spelling is Competitive

@heidimetzler 6 жыл бұрын

Error around 7:50: Linux implementation uses Huffman, not arithmetic coding. Great video.

@stgigamovement 5 жыл бұрын

It originally did use Fenwick arithmetic coding, but the good news is that bwtc was invented which omits the first RLE step of bzip1 and bzip2 and uses arithmetic coding by default.

@simonchou4597 2 жыл бұрын

Why the entropy from 0 to 9 is 4 at 1:28? 3.32 is what I calculated(base 2)

@rustix3 3 жыл бұрын

8:37 The last letter of the first row 'B' should be switched with last letter of the 3rd row 'N'.

@jelouodsa1395 9 жыл бұрын

Colt McAnlis I think there is a better way when doing inverse transforming the BWT output by following the index ... than sorting it over and over again.... It would be great if you share that in the next episode... :)

@ColtMcAnlis 9 жыл бұрын

Jelou odsa Lots of perf optimizations/approaches that we didn't get to talk about in this episode. My favorite has to be RadixZip.

@Duhdad 9 жыл бұрын

Great job!! I Love Compressor Head. Keep feeding my grandkids. .Hey. Wasn't Magus riding a bike last we saw?? I/O Bytes? Is he going for a triathalon thingy?? GO Mag ~(8^(!)~

@josh2731 9 жыл бұрын

One part I did not understand about the sort was what way was he sorting the data. Does it have to be sorted using a specific sorting method(first to last, in clusters, both ends same time) to work. Or as long as it is the same sorting method used in sorting and unsorting

@ColtMcAnlis 9 жыл бұрын

You want a lexicographical sort (A-Z sorting). Doesn't matter what method you use, as long as it lexicographically results in the same matrix. There's other crazy variants of BWT that use different types of sort methods in order to achieve different results (cryptographic, etc) but we didn't have time to go into those ;)

@petergrajciar1504 8 жыл бұрын

In 7:58 I think , that it was wrong sorting. 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAN] 6, [NAB]. It must be: 1, [ABA] 2, [ANA] 3, [ANA] , 4, [BAN] 5, [NAB] 6, [NAN]. So if you will continue in sorting , than result it will be: 1, [NABANA] 2, [NANABA] 3, [BANANA] , 4, [ABANAN] 5, [ANABAN] 6, [ANANAB]. BANANA is in the third line , but when you doing lexical sorting on 4:10 , you have BANANA on fourth line. Sorry for English :)

@MightyPuff 7 жыл бұрын

I agree with you.

@zrouth 9 жыл бұрын

This shit is Bananas

@Chester_Snorflex 7 жыл бұрын

8:09 third row doesnt have a 'B' in it?

@shawn576 7 жыл бұрын

Can't explain how they came up with it = they killed someone and stole it

@SiddharthKulkarniN 9 жыл бұрын

I am 'gelous', and not because it's a pun :) :)

@delucabruno 8 жыл бұрын

Im sory but I don't think that sorting everything back N times is very efficient, cause if I want to compress the whole Harry Potter book, it would take a lot of time to decode..

@blueskyresearch6701 9 жыл бұрын

BWT would be a great way to prep data for RLE. I know RLE am I in kindergarten or what.

@stgigamovement 4 жыл бұрын

Bzip1, bzip2, and bwtc do use rle after bwt, while bzip1 and bzip2 also have rle as the first step

@Wandelaarke 9 жыл бұрын

And again, thank you Google. Go free education about cool stuff that teachers don't understand themselves!

@CasperBang 9 жыл бұрын

Why Move-To-Front rather than RLE after the BW transform? Seems like the clustering of BW makes it a perfect candidate for applying RLE to it, followed by i.e. Huffman encoding?!

@ColtMcAnlis 9 жыл бұрын

Casper Bang RLE works best as an statistical encoder when there's long runs of similar characters, that aren't oddly broken up. Remember that BWT doesn't output homogenous runs of characters. There's no guarantee of that. You may (luckily) end up with a run of 20 of the same character, only to be disrupted by a 20 character run of individual, non-run characters. MTF takes this type of statistical changes into account. If you get 20 "E"s broken up by a single "Z" you don't lose as much encoding efficiency as you would having to restart a run in RLE mode. And to be honest, there's really cool ways to merge MTF and RLE together to make some fancy algorithms. RadixZip introduced something like that : citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3977

@CasperBang 9 жыл бұрын

Colt McAnlis Ah I see, thanks for the info - just used my ACM subscription to get to the RadixZip paper, it's the second time I hear you mention it so it must be a worthy read. :) Just hope it isn't too encompassed by patents and the like.

@Michael_19056 7 жыл бұрын

The strongest encoding methods for BWT do not use any secondary transform at all. M03, for instance, generally achieves the strongest compression of all the BWT encoding techniques and it uses no secondary transforms whatsoever. Instead, the symbols are encoded as they appear in the BWT but using the statistics gathered from the contexts under which that symbol appeared within the original steam. see: www.michael-maniscalco.com/download.htm www.juergen-abel.info/files/preprints/preprint_post_bwt_stages.pdf