The most versatile compression technique? (Burrows-Wheeler Transform)

  Рет қаралды 19,731

Graphicode

Graphicode

Күн бұрын

Пікірлер: 160
@graphicode_
@graphicode_ 2 ай бұрын
Hi! Woke up to some comments and thought I should address some very important points here. Thank you for all the discourse! It makes me happy that people are raising concerns/voicing opinions here. More answers to FAQ/important things can be found in the specific video section here: graphicode.simple.ink/ 1. BWT in itself, is not a compression algorithm. It is more of a transformation (as the name implies), and has to be used in combination with other techniques (such as RLE talked about in the video) for it to be useful. 2. As with all compression techniques, be it BWT+RLE, Lempel-Ziv, Arithmetic Coding, Huffman Coding etc., there is a limit to compression. If not, we would be compressing everything into nothingness, which just isn't possible. I see the inaccuracies in the title, I guess it comes with this caveat, and point (1) itself as well. Wanted to pose "always compressible" as a question originally, which is why the thumbnail has that re-emphasised, but I see why this is an issue. Not sure of a better way to phrase things as of now though... Edit: Title has been changed after thinking for some time on how to phrase it
@BrandonDyer64
@BrandonDyer64 2 ай бұрын
The title and thumbnail are still clickbait.
@TheWebgecko
@TheWebgecko 2 ай бұрын
How to watch that video? It doesn’t seem to work on mobile
@xanderplayz3446
@xanderplayz3446 Ай бұрын
What song was it in the intro? I can’t place my finger on it. Edit: Oh wait Baba is You.
@warmCabin
@warmCabin 2 ай бұрын
A few years ago, a teammate's inept implementation of this very algorithm attempted to load 8 copies of an entire Linux distro into my RAM and hardlocked my computer.
@abdullahajeebi
@abdullahajeebi 2 ай бұрын
bruh
@Relkond
@Relkond Ай бұрын
Oofh - did he fail to reduce the data into byte-size chunks?
@warmCabin
@warmCabin Ай бұрын
It's even less interesting than that. He just didn't free any memory. I'm not talking about a subtle memory leak, he just... didn't... free any memory. From some commented out slop, I could see that he'd tuned the block size from 1024 down to _8,_ but that workaround failed when I tried to load the aforementioned Linux distro through the testbench. Never seen my computer hardlock quite like like that until I started playing Factorio recently... lol To fix it, I just... freed... the memory. When he saw that, he was like, "All that free stuff you did? _Nice_ ." I should mention that he identified C as his "native language." After that, I tried to get him to write a decompressor to prove it worked, but he refused. "We're not doing _decompression_ research, we're doing _compression_ research." 🙄
@abdullahajeebi
@abdullahajeebi Ай бұрын
@@warmCabin Is it really compressed if it can't be decompressed?
@Relkond
@Relkond Ай бұрын
@@abdullahajeebi compressed but can't be decompressed? The proper term is 'hash' what you have is a hash.
@MasterHigure
@MasterHigure 2 ай бұрын
Looks cool, but it still seems like a total coincidence that the final column is more susceptible to run-length encoding. With a few more examples I'm sure it will switch to looking like magic instead.
@BridgeBum
@BridgeBum 2 ай бұрын
I can see probabilistic arguments as to why it is more likely to get grouping (as we remove the same letters from the group, what remains will have relatively more repetition of other elements) but that's hardly proof. I'd be interested in seeing one as well.
@furbyfubar
@furbyfubar 2 ай бұрын
Yeah, the video glossed over *why* it's more likely to have repeated characters after the transform. So much so that I hit up wikipedia. It works because the letters that come after a specific letter (in natural languages) are more likely to be the same. The letter-shifting moves the first letter in each letter-pair into the rightmost column and sorts alphabetically on the second letters in each pair. The video title being *slightly* clickbaity doesn't help as this compression algorithm *doesn't* always work; it only works on text where we expect some letters to be more common after specific letters (so it wouldn't work on say cipher-text) and required a long enough text to for it to save characters. I suspect the video title means "That can always works to decompress", but that should be true of any loss-less compression. Of course, no compression algorithm *can* always make the data smaller; if such an algorithm existed we could feed it its own output and somehow infinitely compress data. Here's wikipedia's explanation of why the Burrows-Wheeler transform works, if I that didn't make enough sense above: To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps less-common exceptions (such as if it contains "ache ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
@MasterHigure
@MasterHigure 2 ай бұрын
@@furbyfubar This begs the question of whether it could be advantageous to reverse the string before encoding. What's more common? "The particular phrase XYZ is most often preceeded by W", which this algorithm seems to like, or "The particular phrase WXY is most often followed by Z", which would favor reversal.
@furbyfubar
@furbyfubar 2 ай бұрын
@@MasterHigure The algorithm only cares about *pair* or two adjacent characters. Not sure if this means the input text order doesn't matter as it becomes symmetrical, or if there might be a difference between how often "a character can predict the next character" and how often "a character can predict the preceding character"? OK, so you *might* have just nerd sniped my into trying to Burrows-Wheeler transform Hamlet and Hamlet in reverse using random webpages.... The online tools I found were running the transformation server side, so they either didn't respond to long texts or they enforced a character limit. So I ended up testing it on only the first 1000 characters of Hamlet instead, plus the same 1000 characters but backwards. After the transforms I then replaced all instances of repeated characters of the output with two characters. (Probably not fully realistic for compression, but good enough for a test.) The result was that the compressed original text input was a single character shorter than the compressed text with the input backwards. So it seem like it at least *can* matter, but probably doesn't matter enough to be worth much further research.
@emilyyyylime-
@emilyyyylime- 2 ай бұрын
​@@furbyfubarThe algorithm does consider more than two characters, given that there is enough input data for patterns to arise. This is because the lexicographic sort considers not just the first character of the word but all characters in order. Some back of the napkin maths tells me that out of 26³=17576 three-letter combinations, if just 10% appear commonly in English, your sample size is still about half as big as it should be to observe the most minimal of effects. I don't see any reason to beholden yourself to server-side 1000 character limits, when you could just run a barebones script locally, even using your own browser. I think I understand peer reviews now.
@muskyoxes
@muskyoxes 2 ай бұрын
The ultimate internet compression algorithm is to just decide not to have any ads or tracking. Boom - you've reduced the bits over the wire by 99%
@MooMooCow-ur1oz
@MooMooCow-ur1oz 2 ай бұрын
😂
@dinhero21
@dinhero21 2 ай бұрын
Why does it "always work"? There even was a counter-example in the video where BANANA$ (7) got transformed (BWT + RLE) into 1A2N1B$2A (9) Also, why does lexicographically sorting all rotations of a string appended with a sentinal character and taking the last character of every rotation make it better at run-length encoding?
@AK-vx4dy
@AK-vx4dy 2 ай бұрын
@@dinhero21 I think always work was about it is reversible and boosting repetitions at the same time (not always ideally but still)
@minamagdy4126
@minamagdy4126 2 ай бұрын
It is impossible to make a compression algorithm that is guaranteed to output a shorter wncoding than the original whole remaining 1-1. What this manages to do is significantly improve on the naive run-length encoding by making it far more likely to have duplicate adjacent letters without sacrificing injectivity.
@graphicode_
@graphicode_ 2 ай бұрын
Hellos! Yep the "always work?" was meant as a question in the title, and not a direct statement. In the video, the BANANA example is intentionally brought as a counterexample to show that it does fail. If I recall in the script I tried to mention that, but I think it could be emphasised more clearly that EVEN a compression algorithm has its limits. And as with any compression algorithm, there is always a compression limit :) Just added a pinned comment talking about this too. Regarding your second comment, do look at the video block here graphicode.simple.ink/ as I added an explanation there to help answer that. Thank you!
@donno2048
@donno2048 2 ай бұрын
Meant to comment that too, it's literally impossible to create a compression algorithm that shrinks every given input
@idle-hands
@idle-hands 2 ай бұрын
@@donno2048 lossy compression
@imabstrong
@imabstrong 2 ай бұрын
High quality stuff! Keep up the good work
@graphicode_
@graphicode_ 2 ай бұрын
Thank you very much! :)
@clehaxze
@clehaxze 2 ай бұрын
Just me or the BGM sounds like Baba Is You?
@ziggyzoggin
@ziggyzoggin 2 ай бұрын
it IS baba is you (I think)
@cheriunderscore
@cheriunderscore 2 ай бұрын
it IS (no doubt about it)
@hcn6708
@hcn6708 2 ай бұрын
baba is song
@notwithouttext
@notwithouttext 2 ай бұрын
BANANA IS TEXT AND SHIFT
@xanderplayz3446
@xanderplayz3446 Ай бұрын
That took me too long to realize.
@GameJam230
@GameJam230 2 ай бұрын
I mean, it can’t ALWAYS work, that violates the pigeon hole principle.
@Matyanson
@Matyanson 2 ай бұрын
the pigeon hole principle does not disallow compression that has output with allways less or exactly equal size, right? Now I wonder if such algorithm exists
@Matyanson
@Matyanson 2 ай бұрын
Maybe I came up with a "trivial" one?! So every string has an end character "$" at the end, where the data ends. Given an input x of length n, follow the instructions: create an empty list of repetitions. iterate over every character ch: 1. scan the following characters and count the repetition of ch. The count is r. 3. create a tuple of the current ch index and r: [index, r] 4. if sizeof(characters that would be removed) > sizeof(tuple): add the tuple to the list remove all but one repeating characters from the input append the list to the end of the input. Examples: "aaaaaaaaah$" ->"ah$09" "aaaah$" -> "ah$04" "aah$" -> "ah$02" is longer -> "aah$" "banana$" -> "banana$" Decoding: "ah$09" 1. find the end of the data "$" 2. check every pair of numbers after: 1. look at index 0: "a" 2.insert "a" 9-1 times at index 0 3. remove data after "$" symbol -> "aaaaaaaaah$"
@Matyanson
@Matyanson 2 ай бұрын
Of course you could improve the algorithm. For example by having a list of number triplets: the tuple is: [index of character, repetition window, number of repetitions]. so for example repetition with window of 2 would be "ananananan" "banana$" -> "bana$122" (too long) -> "banana$" "bananana$" -> "bana$123" window of 12: "best of the best of the best of the best!$" -> "best of the best!$0 12 3" (there would not be any spaces needed after $)
@Matyanson
@Matyanson 2 ай бұрын
Ok, I overlooked that I can't use the character that marks the end $ like this. The end of the data always needs to be marked so the program knows when to stop reading data. Some other character would need to be used in the middle that is not used otherwise... which is a problem. What if the string is structured like this: The program knows how long the data is so it knows when to stop. Now my idea is: If the algorithm decides to include the tuples, add a header before the actual string data like so: The symbol would be the maximum possible value of . This means the algorithm is unable to encode a string of maximum . But is able to encode any other value with the output size smaller or equal to the source.
@GameJam230
@GameJam230 2 ай бұрын
@@Matyanson while it’s true that the pigeonhole principle doesn’t prevent COMPRESSING the data, decreasing the amount of bits that you represent data with can lead to conflicts when two files compress to the same result, which either must occur for two or more pieces of input data, OR some input data cannot be compressed, the latter of which means the algorithm doesn’t always work. But, in the event that a conflict occurs, you need more than one DECOMPRESSION function to return it to its original state, and if you don’t HAVE more than one decompression algorithm, then that means some conflicted compressed data cannot be returned to its original state, but compressed data that can’t be DECOMPRESSED is not useful, and you may as well just be passing the data through a hashing function instead of compressing it, meaning I would say that it’s far more likely that the second case I mentioned is true, which is that some input data cannot be reduced in size, and therefore the algorithm doesn’t “always” work.
@isle_of_violets
@isle_of_violets 2 ай бұрын
your animation style is amazing ♥️
@xelxen
@xelxen 2 ай бұрын
Great video! I can see your bright future, keep up the good work :D
@Asteria-kc9op
@Asteria-kc9op 2 ай бұрын
For anyony curious about the music used in the beginning - it's Baba Is You soundtrack.
@fractalinfinity
@fractalinfinity 2 ай бұрын
Yes, and how do they feel about their music being used here? Did they give permission?
@graphicode_
@graphicode_ 2 ай бұрын
Yep they did! Don't worry I have emailed to check and make sure :)
@fractalinfinity
@fractalinfinity 2 ай бұрын
@@graphicode_ oh good! It’s a great track.
@roostydoo2
@roostydoo2 2 ай бұрын
This is a great video for such a small channel subbed!
@graphicode_
@graphicode_ 2 ай бұрын
Thank you so much :D
@akam9919
@akam9919 Ай бұрын
"Imagine you want to send a message to your best frie--" Guess I gotta imagine having friends.
@SliceJosiah
@SliceJosiah 3 күн бұрын
me too bro
@alexeydmitrievich5970
@alexeydmitrievich5970 2 ай бұрын
Great video. The cat looks very diligent
@panmelnyk
@panmelnyk 2 ай бұрын
nice vid bud. I wish you good luck getting some more eyes on your content, you deserve it frnc
@CatCollective----
@CatCollective---- 2 ай бұрын
i love it when you explain it comprehensively and graphically step by step how the burrow wheeler algorithm scramble and unscramble the data! im interested in the encryption works, like how two private key that never shown to each other can be used to unlock the same locked box of encrypted data, can you make a video of it?
@graphicode_
@graphicode_ 2 ай бұрын
Sure! Though I think this KZbinr called @SpanningTree actually has a very good video on it already :)
@CatCollective----
@CatCollective---- 2 ай бұрын
@@graphicode_ true, understanding that the encryption calculate the keys twice privately is easy enough to understand. though me and some other (as shown in spanning tree comments) have a hard time grasping the concept that the result of each private calculation is halfway of the whole and would end up to the same results. i mean, modulo seems like a destructive calculation that points nowhere in the exponent table.. wondering how doing A then B is equal to doing B then A. its hard for me to explain the confusion myself. so i hope that you can fill in the gaps to clarify where most people lost in the algorithm.
@graphicode_
@graphicode_ 2 ай бұрын
@@CatCollective---- hmm that is interesting! I'll take a look to see if I can make a video out of it
@cc12yt
@cc12yt 2 ай бұрын
Here before 100k views, the production quality is insane 😎
@cc12yt
@cc12yt 2 ай бұрын
Also, less than 1k subs? Such an underrated channel
@graphicode_
@graphicode_ 2 ай бұрын
Thank you! Haha appreciate those kind words
@YTomS
@YTomS 2 ай бұрын
Awesome video, nice job 🙂.
@graphicode_
@graphicode_ 2 ай бұрын
Thank you for that! Glad you popped by :)
@papycoima
@papycoima 2 ай бұрын
I've seen channel with more subs and less quality! Great work. I found the explaination just a tiny bit unclear at times but that's probably cuz it's late and I'm half asleep lol
@graphicode_
@graphicode_ 2 ай бұрын
Thanks! Any specific areas in which are unclear? Would love the feedback so I could write a better script next time
@papycoima
@papycoima 2 ай бұрын
​​@@graphicode_Just to let you know, I'm someone who needs to know everything to the atomical level while learning something new :p That said, the three main doubts I have are: -Why does lexografical sorting give you an easily-compressable string in the last coloumn? -In the reverse process, how does re-sorting to the left and re-filling the last coloumn to the right give you the original grid? - How is the quicker version of the reverse proceds equivalent to the longer one? Specifically the "jumping back to the ith occurence of the letter" was a bit hard for me to make sense of
@Ykulvaarlck
@Ykulvaarlck 2 ай бұрын
you should do a video about suffix arrays, ideally sa-is because i've yet to find a simple explanation of it
@jakeaustria5445
@jakeaustria5445 2 ай бұрын
Thank You
@004307ec
@004307ec 2 ай бұрын
❤ thank you.
@isle_of_violets
@isle_of_violets 2 ай бұрын
i love the baba music
@djelilikejam
@djelilikejam 2 ай бұрын
this music is familiar,,, [2 min later] OH WAIT ITS BABA !!!!!
@jcf_1760
@jcf_1760 25 күн бұрын
Would it be possible to exclude the number 1 from the compressed string in the run-length encoding by saying/assuming any non-numbered letters have an invisible 1 in front of them after encoding? I'm just thinking that we could reduce any transform penalty by removing non-essential characters in the string that comes out of the transform.
@jaceg810
@jaceg810 2 ай бұрын
at 6:14, the bottom most $ is incorrectly written as s. Unless there is a good reason for this?
@graphicode_
@graphicode_ 2 ай бұрын
Good catch! Nice eye - yea it is a mistake haha
@_notch
@_notch 2 ай бұрын
There is no such thing as a compression algorithm that always works. The title is clickbait. Edit: Proof: Try to compress the output of compressing some other data. In order for it to decompressed to compressed data and not to the original data, the input of encoding said data cannot be the same as the output of the data. If the new encoding data is smaller in size, repeat the process with that encoded data. Eventually you will find a counter example that where the output will be bigger than the input.
@graphicode_
@graphicode_ 2 ай бұрын
With any compression algorithm, there is indeed a limit to compression - be it AC, Lempel-Ziv or Huffman codes. I like your short and intuitive proof for the others watching, and I just woke up to many comments, so I'll be addressing the common caveats/things to note! Appreciate your comment :D
@sdjhgfkshfswdfhskljh3360
@sdjhgfkshfswdfhskljh3360 2 ай бұрын
I think explanation can be simplified. If you define "always works" as producing strictly smaller output, than iteratively applying "always working" algorithm will eventually give zero size and negative size results, which make no sense.
@AK-vx4dy
@AK-vx4dy 2 ай бұрын
Nice explanation, but i'm concerned about practical realisation, if we have block of n chars we must build n+1 words, then sort them it seems computationaly and/or memory expensive.
@panmelnyk
@panmelnyk 2 ай бұрын
always works != should be done.
@Zicrus
@Zicrus 2 ай бұрын
It can be computed very efficiently (I believe in both linear memory and time). The explanation in this video is just the general idea, to help understand it intuitively. You don't actually have to compute the n*n grid.
@__Brandon__
@__Brandon__ 2 ай бұрын
You can also just split the message into blocks of 1024 bytes and then encode the blocks separately
@AK-vx4dy
@AK-vx4dy 2 ай бұрын
@@__Brandon__ Yes but then you loose on compression level, depeneding on data of course
@AK-vx4dy
@AK-vx4dy 2 ай бұрын
@@Zicrus Yes i understand pruprose of video, but i wondered if author knows more ;) My intiutution says i don't have too, but to sort them without materalising some other tricks will be needed.
@DanielLenrd
@DanielLenrd 2 ай бұрын
"always works"? if that was true, you could repeat this algorythm until you get to 1 character
@LiEnby
@LiEnby 2 ай бұрын
lol god compression a single bit
@6XeNoN9
@6XeNoN9 2 ай бұрын
really nice video !
@sdjhgfkshfswdfhskljh3360
@sdjhgfkshfswdfhskljh3360 2 ай бұрын
There is an easy way to see that no compression algorithm is perfect: Just try to compress file filled with pure random noise. *Every* compression algorithm will give output larger than input. If you are getting smaller file, than your noise is not pure enough.
@YEWCHENGYINMoe
@YEWCHENGYINMoe 2 ай бұрын
nice
@theunknown4834
@theunknown4834 2 ай бұрын
Random qn, are you Singaporean?
@graphicode_
@graphicode_ 2 ай бұрын
Yes haha it's the accent eh?
@theunknown4834
@theunknown4834 2 ай бұрын
​​@@graphicode_too many times have I heard tuition recordings in Popular
@NakamuraSatou
@NakamuraSatou 2 ай бұрын
This is so fun
@coolguyflex
@coolguyflex 2 ай бұрын
What's with the Baba is you music?
@Hi-gf4ts
@Hi-gf4ts 2 ай бұрын
I loved the video! I find compression techniques fascinating, and I had never heard of the Burrows-Wheeling Transform before. I made an implementation of it in JavaScript that compresses strings, but I ran into a problem. Even though it is consistently better than run-length compression on its own, it almost always seems to produce a string that is bigger than the original string. It only seems to produce a smaller output if the input is highly repetitive. Is this a known issue? Is there a workaround?
@ironfreak999
@ironfreak999 2 ай бұрын
You could run into multiple issues here: 1. If you store characters A-Z and $, you won't fill up a byte per char (which is probably the default in JS). So you're effectively wasting like 3 bits per character. But this can be somewhat mitigated by a dense bit-packing (or another encoding). 2. Suppose you don't care about those densly packed bytes and just count the characters. Then only certain repetetive patterns will be compressed by this algorithm, as you have observed. This is due to the nature of "RLE" adding a length to each repeated chunk. If there are less repeated chunks, you will have more length tokens in your compressed string. 3. In case you try to compress random strings you almost always end up with more data (lookup incompressible string on wiki, if you want) 4. There is actually no compression algorithm, that always yields less data. Because of this every compression algorithm will most likely fail for random data and heavily favours "repetative data". This can be shown by the number of unique strings per length. Suppose your string is n chars (bytes) long, so there are 256^n possible strings of that length. If every string would be compressed to at least "n-1" chars, there must be some pairs of strings, which look the same when compressed, since 256^(n-1) < 256^n. These collisions make it impossible to recover the original string. So yes this is a known issue with these kinds of compressions. A workaround would be to use a more suitable compression algorithm for the kind of data you want to compress. Usually there are algorithms for each use-case, for example images - jpg / videos - h264 / text - DEFLATE. If you want to dig more into text-compression, huffman encoding might be another thing to look into.
@Pystro
@Pystro 2 ай бұрын
The problem is that all the Z's in the first column (which are all in a single block at the bottom) match with _all_ the different letters that precede Z's in the last column. Which means that the end of the transformed string (assuming it _has_ Z's in the first place) has for example a higher proportion of vowels than usual (because consonants like Z pair with vowels more often than with other consonants). But the *order* of all of those letters that precede Z's is still more 'random' than we'd like. To give the two most extreme examples, the block of U's in the first column are where _all_ of the Q's will inevitably appear, but it won't _only_ be Q's. (And similarly for spaces being preceded by all dots in the text, but not exclusively by dots.) And on the other hand, capital letters are _always_ preceded by space characters; unless you spell any words in ALL CAPS. (Well, one of them will be preceded by the sentinel symbol, and maybe some by quotation marks or parentheses.) It's why bzip doesn't just do a BWT and then RLE. It first uses a *BWT* - in order to turn correlations between di-grams into _higher concentrations_ of letters in certain parts of the transformed string, then it uses the *"move-to-front" algorithm* - to turn concentrations of similar letters (and also just higher occurrence rates of certain letters like E's) into higher occurrence rates for symbols with low ID's, and then uses *entropy coding* - in order to turn the most frequent symbols (those with low ID's) into shorter bit codes. If you're playing around with it, I've always thought that using a custom alphabet (symbol order) might group the symbols even better. Even just one that is known ahead of time to sender and receiver could give significant improvements. If you use a custom alphabet, you'd most definitely want to use it for the sorting (especially where lower case vowels and lower case consonants are in separate groups, and all capital letters are appearing in an uninterrupted block the alphabet, and you can probably do _something_ smart with the space character too). And for seeding the the move-to-front algorithm in bzip you may want the same custom alphabet, the standard alphabet, or a second custom alphabet. (This one will mainly need to have the symbols in order of their frequency - or actually in the order in which they are most likely to appear in the BWT transformed string.)
@Serizon_
@Serizon_ 2 ай бұрын
baba is you song!!!!!!!!!!!!!!!!!!!!!!!
@luketurner314
@luketurner314 2 ай бұрын
About to comment the same thing. It's a great game
@edems131
@edems131 2 ай бұрын
I'm 491 sub, great video btw
@xcoder1122
@xcoder1122 2 ай бұрын
BWT is NO compression algorithm at all. The video explains that correctly but the title claims it is, yet by itself BWT will not make data any smaller. And when combined with RLE and Huffman Coding, as in bzip2, it will not always work to reduce data size. RLE needs one character to signal a RLE sequence will follow (indicator). The problem is that if every character can be a valid character of the data stream, the indicator must be encoded somehow when you really want that character and not a RLE sequence and this encoding increases the data size. By using move to front encoding, you can ensure that the indicator is always the least frequently used character but if all characters are used about equally often in the data stream that buys you nothing. And if the data stream on top of that has no significant runs, not even after BWT, then the encoding overhead for the indicator will produce an even bigger data stream than your input stream. And as all characters are used about equally often, Huffman Coding cannot reduce the size either but would only increase it even further. So it's easy to construct a data stream that is not compressible with BWT at all and will even grow in size, after RLE and cannot be further reduced by Huffman Coding either.
@emurphy42
@emurphy42 2 ай бұрын
IIRC it's impossible for any lossless algorithm to compress 100% of inputs, because there are 2^N possible inputs with N bits, but only 2^N - 1 possible outputs with at most N - 1 bits, so something won't be able to fit no matter how clever the mapping is. (And no sneaky "add the extra bits to the filename", because you need room to store the filename too.) Of course many algorithms do a good job on the vast majority of inputs, and lossy algorithms can compress further because it's okay for multiple inputs to map to the same output.
@xcoder1122
@xcoder1122 2 ай бұрын
@@emurphy42 Many compression tools will at least not increase data sizes as they detect that the data is not compressible by it's compression method and then just adds the data uncompressed. That way at least data size will not grow but it will not shrink either. Some also implement more than one method and will try to switch to another method which sometimes can help. But your calculation is correct, of course there will always be an input that cannot be mapped to a smaller output with any given set of methods.
@reddragonflyxx657
@reddragonflyxx657 2 ай бұрын
​@@xcoder1122Even if a compression tool doesn't compress the data, it has to indicate that the data wasn't compressed so the decompression tool doesn't interpret it as a compressed data stream.
@graphicode_
@graphicode_ 2 ай бұрын
Yep you are right, I guess BWT in itself is not very useful, it has to be coupled with other things. BWT is more of a... pre-processing/transformation technically, but I was struggling to find the right word to encapsulate "block-sorting compression/transform + RLE" into the title. I did feel quite iffy and expected comments like these to point it out, so I am happy that there are people who notice! I'll add a pinned comment explaining things in more detail As with compression, analyzing it from an info theory perspective would be extremely cool and insightful. I'm very glad you brought up HC in this too!
@Jake28
@Jake28 2 ай бұрын
"The compression algorithm that always works?". I'm writing this in my comment incase the title of the video is changed later. I'm assuming by 'always works', we mean that it always compresses things to be smaller than they currently are. I hold that this is a reasonable assumption to be made from the title of the video. Here's the issue: That's impossible. If a hypothetical compression algorithm that acted in that way existed, we would be able to compress any message into any arbitrary amount of space. We can't. So, the situation proposed in the title of the video is impossible. I noticed the pinned comment after I finished writing this, but I'll still post it anyways. I might as be a part of the record of people pointing out the innacuracy, especially considering that despite a correction being posted in the comments, the title remains unchanged. At least add an asterisk or something in there... Congrats on making a video geared towards computer science audiences, while simeltaneously making an error obvious enough that most of them would probably click the video out of spite? I guess that's one way to game the algorithm!
@Jake28
@Jake28 2 ай бұрын
Oh nice they changed the title. Comment retracted...
@graphicode_
@graphicode_ 2 ай бұрын
Sorry for the delay! Took a while to think of a good substitute and was caught up with work... Anyhow, appreciate your thoughts, as with many others :)) All forms of opinions, good or bad, are important to me
@NeunEinser
@NeunEinser 2 ай бұрын
Neat!
@MatthewFearnley
@MatthewFearnley 2 ай бұрын
Wow, the BWT is like nothing I've ever seen before. But struggling to understand what's going on in the reverse algorithm...
@graphicode_
@graphicode_ 2 ай бұрын
Which section is confusing? If you are interested - feel free to join the discord channel and perhaps we could discuss as a community too!
@MatthewFearnley
@MatthewFearnley 2 ай бұрын
@@graphicode_ Hey, thanks for your reply. I understand how we can get the first column from lexicographically sorting the given last column. But for me things get kind of hazy from there. I'm not hugely into Discord communities, but I appreciate the invitation. If you think it's going to be a better place to ask and explore, maybe I'll take a look.
@graphicode_
@graphicode_ Ай бұрын
Hey sorry for the late reply - kinda missed this. It's hard to type out explanations here - if you email me I can insert some diagrams to help as well, if you want of course!
@MatthewFearnley
@MatthewFearnley Ай бұрын
@@graphicode_ Hey, no worries, thanks for getting back to me. If you think it will be helpful for others, I think I might actually join the Discord. Is there a particular channel I should post in?
@graphicode_
@graphicode_ Ай бұрын
@@MatthewFearnley Heyya - started a channel called questions for vids which I felt could be helpful in the long run since my answers can be grouped topically too!
@edhyjoxenbyl1409
@edhyjoxenbyl1409 2 ай бұрын
BABA music!!
@tdubmorris5757
@tdubmorris5757 24 күн бұрын
baba is video background music
@grahamcracker1234
@grahamcracker1234 2 ай бұрын
commenting for engagement
@StefanReich
@StefanReich 2 ай бұрын
"i-the" should read "i-th" Good work otherwise, nice explanation
@graphicode_
@graphicode_ 2 ай бұрын
Thanks for the catch :D and appreciate it
@sdspivey
@sdspivey 2 ай бұрын
Ok, now compress "abcdef".
@BridgeBum
@BridgeBum 2 ай бұрын
Yup, RLE cannot work on all data blocks. That's often why real world implementations use multiple algorithms. Having said that, bytes can only have 256 possible values so longer data blocks can guarantee repetition. It's also why textual data tends to compress better than binary blocks, higher frequencies of individual characters.
@wanfuse
@wanfuse 2 ай бұрын
f
@nisonatic
@nisonatic 2 ай бұрын
@@wanfuse I had heard rumors of the Pay Respect algorithm, but never seen it.
@wanfuse
@wanfuse 2 ай бұрын
110
@evnnxi
@evnnxi 2 ай бұрын
abcdef& &abcdef f&abcde ef&abcd def&abc cdef&ab bcdef&a &abcdef abcdef& bcdef&a cdef&ab def&abc ef&abcd f&abcde f&abcde 1f&1a1b1c1d1e just dont compress this i guess.
@coderized
@coderized 2 ай бұрын
1A69N1B1$69A
@graphicode_
@graphicode_ 2 ай бұрын
👀
@micmal2967
@micmal2967 2 ай бұрын
There is no compression algorithm that always works. a compression algorithm that always work will take all possible string from length 1 to n and will map each one independently to a new string such that the sum of the length of the maped strings will be less then the origin; which means that at least two strings are maped to the same output string
@logitech4873
@logitech4873 2 ай бұрын
Here before 100K views :)
@notwithouttext
@notwithouttext 2 ай бұрын
wheelbarrow transform
@Khusyasy
@Khusyasy 2 ай бұрын
ah yes baba soundtrack
@dingfengwong1769
@dingfengwong1769 Ай бұрын
Singaporean
@TheMasonX23
@TheMasonX23 2 ай бұрын
"Evil Telecommunications company" is a bit redundant, isn't it?
@graphicode_
@graphicode_ 2 ай бұрын
:P perhaps perhaps
@alexdefoc6919
@alexdefoc6919 2 ай бұрын
Do the dab 😂❤
@Stztic
@Stztic 2 ай бұрын
Cat, :3
@StefaanHimpe
@StefaanHimpe 2 ай бұрын
if it always works you can compress anything to 0 bits - good luck restoring :d. Not worth my time watching....
@Kurushimi1729
@Kurushimi1729 2 ай бұрын
What's with the stupid title? A compression algorithm that always works is obviously impossible. Is this for clickbait?
@Kurushimi1729
@Kurushimi1729 2 ай бұрын
What's with the stupid title? A compression algorithm that always works is obviously impossible. Is this for clickbait?
Programming with Math | The Lambda Calculus
21:48
Eyesomorphic
Рет қаралды 235 М.
Муж внезапно вернулся домой @Oscar_elteacher
00:43
История одного вокалиста
Рет қаралды 5 МЛН
Миллионер | 3 - серия
36:09
Million Show
Рет қаралды 2 МЛН
СКОЛЬКО ПАЛЬЦЕВ ТУТ?
00:16
Masomka
Рет қаралды 3,2 МЛН
The IMPOSSIBLE Puzzle..
00:55
Stokes Twins
Рет қаралды 161 МЛН
How Bzip2 Works (Burrows Wheeler Transform) - Computerphile
10:41
Computerphile
Рет қаралды 49 М.
The Most Valuable File Format You've Never Heard Of
15:33
Acerola
Рет қаралды 521 М.
Factorio teaches you software engineering, seriously.
21:27
Tony Zhu
Рет қаралды 1,9 МЛН
C++ vs Rust: which is faster?
21:15
fasterthanlime
Рет қаралды 403 М.
Making an Algorithm Faster
30:08
NeetCodeIO
Рет қаралды 147 М.
What P vs NP is actually about
17:58
Polylog
Рет қаралды 131 М.
What is the Smallest Possible .EXE?
17:04
Inkbox
Рет қаралды 508 М.
The Man Who Solved the $1 Million Math Problem...Then Disappeared
10:45
Муж внезапно вернулся домой @Oscar_elteacher
00:43
История одного вокалиста
Рет қаралды 5 МЛН