i spent an entire evening looking for aho-corasick resources, but this video really made it click for me. especially at 6:40 when you run through the whole structure. thanks so much!!
@niemasd2 жыл бұрын
No problem! I'm glad you enjoyed the video :-)
@juan7354 Жыл бұрын
I had the same exact experience! Having someone walk through the process really made things clear to me!
@tarsala19953 жыл бұрын
That one is a god damn good explanation. Big props for Aho and Corasick
@cemtunaboylu84212 жыл бұрын
There actually is a catch here. Although the string matching machine still performs O(n), you may have forgotten to mention there are failure links to account for too. In the paper, the number of failure link transitions is said to be bound by the depth of the state, i.e. the length of the shortest path from initial state to current state. So in a worst case scenario, even though each iteration only has one successful state transition, it may have d-1 fail link transitions. In a worst case scenario, the longest sequence can be as big as the query string, thus d is n. Finally since there are d-1 failure link transitions at most, and exactly one successful state transition for each symbol in the query string, total bound is < 2n. O(2n) -> O(n). I think it is important notice that. Thank you for the video.
@niemasd2 жыл бұрын
Thank you for the additional context! This is super helpful
@shikharsav2 жыл бұрын
Great video, very clearly explained. Thanks! I think arguing the running time is more involved than this. One could spend up to k time on a failure (as someone pointed out) but with that the running time would be O(n*k) since we could potentially fail at every character. To arrive at O(n) running time, we need to make another observation. Every time we stay on the same input character and follow some failure links let's say f, we know we had at least f successes in the past (otherwise we wouldn't be able to make our way down the trie). That means #failures for the entire duration is proportional to n, giving us the O(n) bound.
@dulanjanaliyanagama38233 жыл бұрын
Great and simple explanation in a nut shell. Thank you very much.
You really explained the algorithm clearly. Well done, bro. Big thanks
@jonathanbryanschmalhofer98894 жыл бұрын
Best explanation I could find so far, thanks.
@taivinh19863 жыл бұрын
the best explanation for this complicated algorithm
@neutt46304 жыл бұрын
This is the quality content that youtube needs more
@zhendongtian-d7t Жыл бұрын
Such an amazing teacher! Thanks!
@Metaloid-wv4kz3 жыл бұрын
QUESTION!: So the whole idea is to make searching faster by not starting at the start of the tree root anymore, and instead use dashed line links to start farther along a branch?
@niemasd3 жыл бұрын
Precisely!
@ttttully3 жыл бұрын
Great explanation, thank you!
@jay5929 Жыл бұрын
cool
@katze2775 Жыл бұрын
OMG Very clear!!!! thank you for sharing
@mskiptr3 жыл бұрын
Great explanation! Clear and interesting. I have one question however - what should happen to a (sub)string GCT? Since the root has no outgoing T arc, we will fail indefinitely… I guess, it would work if we created some self-loops (between the root and the root) for every otherwise undefined character. Ok, it seems I kind-of already answered it myself. ; ) Still, if someone knows an alternative approach I would gladly read about it!
@niemasd3 жыл бұрын
You are correct: once you go from root to root via the self-pointing failure link, you just move on to the next character
@elisey20282 жыл бұрын
Awesome explanation! Thank you!
@Squirrelschaser3 жыл бұрын
One confusing thing about actually implementing this, I realized, is that you don't "traverse" the suffix links at all when you're querying. The suffix links help you build additional blue links during the linking process.
@niemasd3 жыл бұрын
Precisely! We discuss "dictionary links" a couple videos later in this playlist
@Gentleman217 Жыл бұрын
Hey Niema, Thank you for the video. Which pen and board are you using? I need to buy one.
@niemasd Жыл бұрын
No problem! And I used a Wacom One tablet. Here's a write-up on my process for making the videos: github.com/niemasd/teaching/blob/master/Tutorials/VideosKhanAcademy.md
@Gentleman217 Жыл бұрын
@@niemasd Thank you!
@Gentleman217 Жыл бұрын
@@niemasd oww that repo is exactly what I'm looking for 😁
@niemasd Жыл бұрын
@@Gentleman217 Awesome, glad to hear it! 🙂
@shiroclown61312 жыл бұрын
What's the worst time for building the structure?
@shelleyfan99813 жыл бұрын
very intuitive explanation. Appreciate it!
@ulq71593 жыл бұрын
Over a year later... still a good explanation for useful info, thanks
@niemasd3 жыл бұрын
I'm glad you're enjoying it! :-)
@steniorj Жыл бұрын
Great video. Thank you! I have one question, though. When building a failure link for the character "c" you said we must choose the longest suffix. Why is the longest suffix in "CAT" and not in "GCG"? Does that mean that "AT" is a suffix longer than "G"?
@niemasd Жыл бұрын
Can you specify the timestamp? I'm not 100% sure which step you are referring to. In general, we draw a failure link from node u to whatever prefix in the tree (i.e., whatever path starting at the root) matches the longest possible suffix of u (i.e., the longest possible path ending at u).
@steniorj Жыл бұрын
@@niemasd I get it now. Thank you my friend!
@widezenico2 жыл бұрын
great!, this video is very fun :D
@BlastinRope3 жыл бұрын
Good video and succint presentation
@rc_woshimao957Ай бұрын
PROFESSOR NIEMA LE GOAT
@zhongyuanchen8424 Жыл бұрын
what is the time complexity for building the automaton?
@niemasd Жыл бұрын
Using a clever algorithm, we can construct an Aho-Corasick Automaton in O(mk). These slides might be helpful (starting from Slide 60 for the efficient Aho-Corasick Automaton building algorithm): web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/02/Small02.pdf Note that "m" in the slides equates to our "n" (the length of the long string), and "n" in the slides equates to our "mk" (the total number of characters in all of our short strings)
@zhongyuanchen8424 Жыл бұрын
@@niemasd Thank you so much for taking your time to find me the reference. You are truly incredible. Much appreciated.😀
@utkarshkathuria29313 жыл бұрын
Is there a rule like only the patterns whose last letter is the leaf node, can be said to be detected from the given input string?
@niemasd3 жыл бұрын
No, it is not necessary for the last letter of every word in our database to correspond to a leaf node: it is possible for words to end at internal nodes, and we will still be able to detect them (there will be examples of this in the later videos on this topic). It just so happens that this specific example only has words that end at leaves
@human-011 Жыл бұрын
I cracked it, its om 🕉 😂
@anushalihala64043 жыл бұрын
Great explanation! What happens with the database contains words of different lengths? for instance if 'a' was in our database but after 'cat' we took the failure link to 'at' skipping 'a'
@niemasd3 жыл бұрын
Great insight! There would be a "dictionary link" from the node ROOT->C->A to the node ROOT->A, so as you scan the word "CAT", you would start at the root, traverse the C edge, traverse the A edge, see there's a dictionary link (and output the word corresponding to the dictionary link), and then continue down the T edge
@manpreetsingh41064 жыл бұрын
This is very helpful
@ayushbachan61133 жыл бұрын
great explanation
@shahrozsaleem347111 ай бұрын
How will this automaton help in the following scenario? I have keywords to search are: - GATEWAYS - ATE I want to search them in the following text: “The gateway is long” Here when we will be at vertex from an edge ‘y’ then we will not have any child so we will go through failure link to root. But during this we missed that ATE existed.
@niemasd11 ай бұрын
Excellent question! This is actually described in the next video of the series. Specifically, we will be able to catch those cases via "dictionary links": kzbin.info/www/bejne/hXeuqYp8mtySgpI
@KanagaveluSugumar4 жыл бұрын
Excellent!! Thank you!!!
@richardharris97083 жыл бұрын
Good explanation but would be good to mention the dictionary links too!
@niemasd3 жыл бұрын
Great insight! They're discussed in the next video in the series :-) kzbin.info/www/bejne/hXeuqYp8mtySgpI
@idan3073 жыл бұрын
How would you cover a case like a string of "mnopqr", and strings to find: "mnop" and "no"? what are the failure links gonna look like?
@niemasd3 жыл бұрын
If the only words in the Aho-Corasick Automaton are "mnop" and "no", there is a failure link from "mn" to "n" and another failure link from "mno" to "no"
@calebdecoux26763 жыл бұрын
@@niemasd With your video example, we will just match "mnop" and miss "no". Do we just have to check the failure node every time to see if it's also a word node? In video example it doesn't cover this situation where a word is matched while you're traversing in a different path.
@drunken_parrot Жыл бұрын
thank you so much
@julianferres4 жыл бұрын
Amazing and clear explanation. Good Job!
@amitavamozumder733 жыл бұрын
what if there's no g in teh words, but i got a g in the stream, should i keep checking failure links till i hit the root and discard that letter?
@niemasd3 жыл бұрын
Exactly: you would keep following failure links until you hit the root, then you would follow the root's failure link, and just discard that letter
@amitavamozumder733 жыл бұрын
@@niemasd thanks, one more question, if i am finding subsequece instead of substrings and i find a non existent character, i should just stay in last failure link right?
@rahulbawa39694 жыл бұрын
Great explanation. Can you please tell what is this black board that you used in this video ?
@niemasd4 жыл бұрын
Sure! Here's a complete guide of how I created these videos: github.com/niemasd/teaching/blob/master/Tutorials/VideosKhanAcademy.md
@rahulbawa39694 жыл бұрын
@@niemasd Wow! Thanks for a quick reply. Really appreciate it.
@codejunky9831 Жыл бұрын
wow so neat :)
@Jacked_Nerd4 жыл бұрын
Thank you
@sovanraksa21124 жыл бұрын
Absolutely awesome👏 keep it up bro
@chahatagrawal14414 жыл бұрын
your explanation is awesome !!!
@shameekagarwal48724 жыл бұрын
that was really good... a 3 minute pseudo code walkthrough for failure links would have been good though
@somiljain76704 жыл бұрын
Keep up the good work.
@alamjim61174 жыл бұрын
Thanks It was great explanation
@ShubhamKumar-qz6yz4 жыл бұрын
I am confused, ACC and GCG also exist right? But we didn't find those while going through the main word
@niemasd4 жыл бұрын
I'm not sure what you mean by "also exist": ACC and GCG exist in our database of short sequences, but remember that we *only* care about short sequences that occur in our long query sequence. Because our long query sequence didn't contain ACC and GCG, neither were found during our search (which is the behavior we want)
@ShubhamKumar-qz6yz4 жыл бұрын
@@niemasd ooh I see. Thank you for the explanation. Makes a lot more sense now. 🙏🙏