How do you actually construct the suffix arrays where colors interlace together ?
@litianyi4763 жыл бұрын
Thanks for your amazing video. A min-heap is also good to solve when sliding.
@hanzhoutang92353 жыл бұрын
Thanks for uploading the video. It's really clear and smart!
@gaurishgangwar2 ай бұрын
Why do we need to expand and shrink the window? Can we do it by sliding window of constant size (=K)?
@user-or7ji5hv8y5 жыл бұрын
this is a really good explanation!
@patryk_492 жыл бұрын
In C the last sentinel value is already provided for you.
@ts-ny6mx4 жыл бұрын
I love this video, which is very easy to follow. I have a question, though: how do you actually compare the 'colours' at 4:31? I guess it would be to find the 'nearest senitel' (in this case #, % and $, respectively), but I'm not sure.
@tenthlegionstudios13433 жыл бұрын
Yes you can do this, there are a lot of other options as well. I would suggest watching MITs lecture on suffix trees, its in depth and ends with Suffix arrays, suffix array construction, and LCP arrays. Goes in depth about how a SA and a LCP array are just a compressed trie. MIT lecture 16. Strings, I think.
@ts-ny6mx3 жыл бұрын
@@tenthlegionstudios1343 Great, thank you!
@JobvanderZwan3 жыл бұрын
Isn't this the approach used by femtozip to build an effective "dictionary" string to prepend to the input, so that the actual strings in the text compress really well?
@uknow29082 жыл бұрын
Sorry, but I lost you on colors. I don't understand how colors are determined.
@charaiveticharaiveti84144 жыл бұрын
Correction 8:08 ..more of these to be greater than *zero.
@user-ov5nd1fb7s6 жыл бұрын
Awesome, thanks.
@user-ov5nd1fb7s5 жыл бұрын
The sentinels must be unique,. What happens if you have a lot of strings. Surely, eventually you are going to ran out of unique ones. Maybe i am missing something?
@WilliamFiset-videos5 жыл бұрын
Correct, the solution is to map your alphabet to the natural numbers and shift up by the number of sentinels you need. This allows you to always have sentinels between the values say [1,N] and your alphabet above that. This trick makes the suffix array scaleable, but you need to undo the shift the decode the true value stored in the suffix array
@user-ov5nd1fb7s5 жыл бұрын
@@WilliamFiset-videos thanks
@androiddl38405 жыл бұрын
@@WilliamFiset-videos Is there any real need to use unique sentinels though? I am guessing the only reason we do so is so that when we we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing. EG ABC# ABC$ this would yield 3, as opposed to (at least) 4, which would be the case if the sentinels were all the same: ABC# ABC# However, if we always use the same sentinel, which cannot appear in any of the strings, why not just compare each character we encounter in each string, and when it is the (known) sentinel, we know to stop counting? So in this case, we would get to 3, then stop because we have hit the sentinel... Or have I missed something?
@victordeluca73603 жыл бұрын
@@androiddl3840 I may have misunderstood what you said, but if your idea is to iterate over every character of every suffix until you find a sentinel character, that approach is too slow and kind of kills the purpose of using a suffix array. We use a suffix array precisely to avoid iterating over the characters of our strings too many times.
@androiddl38403 жыл бұрын
@@victordeluca7360 thanks for the reply :) When building the LCP array, we calculate all the suffixes, then sort them lexicographically, then we deduce how many characters each pair has in common. So say I have concatenated multiple strings and separated them with the sentinel '$'. We have the LCP array constructed and sorted, and we are now deducing the number of common characters. Say that we have ABC$E ABC$F as part of the sorted LCP array. I want to deduce the number of characters in common between these; clearly it is 3, because we don't count the sentinel character, $. Lets call these strings s1 and s2. We can then say this to calculate the common character count: int len = min(s1.length(), s2.length()); int count = 0, index = 0; while (index < len - 1 && (s1[index] == s2[index] && s1[index] != '$')) { ++index; ++count; } If the sentinels were unique, say ABC$E ABC%F ...then the algorithm would be simpler, since we don't need to check each character against the sentinel: int len = min(s1.length(), s2.length()); int count = 0, index = 0; while (index < len - 1 && (s1[index] == s2[index]) { ++index; ++count; } So, you are right in that using the same sentinel is slightly less efficient as a consequence of this extra check - there will indeed be many branch mispredictions involved - but I'd argue that the time spent generating unique sentinels may be greater than the cost of these, hence it would be more inefficient. What do you think?