Longest common substring problem suffix array

Рет қаралды 40,484

WilliamFiset

Күн бұрын

Пікірлер: 28

@MauriPastorini 6 жыл бұрын

Great video man! Please keep doing these videos!

@FireEx-br8mq 3 жыл бұрын

How do you actually construct the suffix arrays where colors interlace together ?

@litianyi476 3 жыл бұрын

Thanks for your amazing video. A min-heap is also good to solve when sliding.

@hanzhoutang9235 3 жыл бұрын

Thanks for uploading the video. It's really clear and smart!

@gaurishgangwar 2 ай бұрын

Why do we need to expand and shrink the window? Can we do it by sliding window of constant size (=K)?

@user-or7ji5hv8y 5 жыл бұрын

this is a really good explanation!

@patryk_49 2 жыл бұрын

In C the last sentinel value is already provided for you.

@ts-ny6mx 4 жыл бұрын

I love this video, which is very easy to follow. I have a question, though: how do you actually compare the 'colours' at 4:31? I guess it would be to find the 'nearest senitel' (in this case #, % and $, respectively), but I'm not sure.

@tenthlegionstudios1343 3 жыл бұрын

Yes you can do this, there are a lot of other options as well. I would suggest watching MITs lecture on suffix trees, its in depth and ends with Suffix arrays, suffix array construction, and LCP arrays. Goes in depth about how a SA and a LCP array are just a compressed trie. MIT lecture 16. Strings, I think.

@ts-ny6mx 3 жыл бұрын

@@tenthlegionstudios1343 Great, thank you!

@JobvanderZwan 3 жыл бұрын

Isn't this the approach used by femtozip to build an effective "dictionary" string to prepend to the input, so that the actual strings in the text compress really well?

@uknow2908 2 жыл бұрын

Sorry, but I lost you on colors. I don't understand how colors are determined.

@charaiveticharaiveti8414 4 жыл бұрын

Correction 8:08 ..more of these to be greater than *zero.

@user-ov5nd1fb7s 6 жыл бұрын

Awesome, thanks.

@user-ov5nd1fb7s 5 жыл бұрын

The sentinels must be unique,. What happens if you have a lot of strings. Surely, eventually you are going to ran out of unique ones. Maybe i am missing something?

@WilliamFiset-videos 5 жыл бұрын

Correct, the solution is to map your alphabet to the natural numbers and shift up by the number of sentinels you need. This allows you to always have sentinels between the values say [1,N] and your alphabet above that. This trick makes the suffix array scaleable, but you need to undo the shift the decode the true value stored in the suffix array

@user-ov5nd1fb7s 5 жыл бұрын

@@WilliamFiset-videos thanks

@androiddl3840 5 жыл бұрын

@@WilliamFiset-videos Is there any real need to use unique sentinels though? I am guessing the only reason we do so is so that when we we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing. EG ABC# ABC$ this would yield 3, as opposed to (at least) 4, which would be the case if the sentinels were all the same: ABC# ABC# However, if we always use the same sentinel, which cannot appear in any of the strings, why not just compare each character we encounter in each string, and when it is the (known) sentinel, we know to stop counting? So in this case, we would get to 3, then stop because we have hit the sentinel... Or have I missed something?

@victordeluca7360 3 жыл бұрын

@@androiddl3840 I may have misunderstood what you said, but if your idea is to iterate over every character of every suffix until you find a sentinel character, that approach is too slow and kind of kills the purpose of using a suffix array. We use a suffix array precisely to avoid iterating over the characters of our strings too many times.

@androiddl3840 3 жыл бұрын

@@victordeluca7360 thanks for the reply :) When building the LCP array, we calculate all the suffixes, then sort them lexicographically, then we deduce how many characters each pair has in common. So say I have concatenated multiple strings and separated them with the sentinel '$'. We have the LCP array constructed and sorted, and we are now deducing the number of common characters. Say that we have ABC$E ABC$F as part of the sorted LCP array. I want to deduce the number of characters in common between these; clearly it is 3, because we don't count the sentinel character, $. Lets call these strings s1 and s2. We can then say this to calculate the common character count: int len = min(s1.length(), s2.length()); int count = 0, index = 0; while (index < len - 1 && (s1[index] == s2[index] && s1[index] != '$')) { ++index; ++count; } If the sentinels were unique, say ABC$E ABC%F ...then the algorithm would be simpler, since we don't need to check each character against the sentinel: int len = min(s1.length(), s2.length()); int count = 0, index = 0; while (index < len - 1 && (s1[index] == s2[index]) { ++index; ++count; } So, you are right in that using the same sentinel is slightly less efficient as a consequence of this extra check - there will indeed be many branch mispredictions involved - but I'd argue that the time spent generating unique sentinels may be greater than the cost of these, hence it would be more inefficient. What do you think?