Advanced Data Structures: Aho-Corasick Automaton

  Рет қаралды 62,267

Niema Moshiri

Niema Moshiri

Күн бұрын

Пікірлер: 82
@Zashxq
@Zashxq 2 жыл бұрын
i spent an entire evening looking for aho-corasick resources, but this video really made it click for me. especially at 6:40 when you run through the whole structure. thanks so much!!
@niemasd
@niemasd 2 жыл бұрын
No problem! I'm glad you enjoyed the video :-)
@juan7354
@juan7354 Жыл бұрын
I had the same exact experience! Having someone walk through the process really made things clear to me!
@tarsala1995
@tarsala1995 3 жыл бұрын
That one is a god damn good explanation. Big props for Aho and Corasick
@cemtunaboylu8421
@cemtunaboylu8421 2 жыл бұрын
There actually is a catch here. Although the string matching machine still performs O(n), you may have forgotten to mention there are failure links to account for too. In the paper, the number of failure link transitions is said to be bound by the depth of the state, i.e. the length of the shortest path from initial state to current state. So in a worst case scenario, even though each iteration only has one successful state transition, it may have d-1 fail link transitions. In a worst case scenario, the longest sequence can be as big as the query string, thus d is n. Finally since there are d-1 failure link transitions at most, and exactly one successful state transition for each symbol in the query string, total bound is < 2n. O(2n) -> O(n). I think it is important notice that. Thank you for the video.
@niemasd
@niemasd 2 жыл бұрын
Thank you for the additional context! This is super helpful
@shikharsav
@shikharsav 2 жыл бұрын
Great video, very clearly explained. Thanks! I think arguing the running time is more involved than this. One could spend up to k time on a failure (as someone pointed out) but with that the running time would be O(n*k) since we could potentially fail at every character. To arrive at O(n) running time, we need to make another observation. Every time we stay on the same input character and follow some failure links let's say f, we know we had at least f successes in the past (otherwise we wouldn't be able to make our way down the trie). That means #failures for the entire duration is proportional to n, giving us the O(n) bound.
@dulanjanaliyanagama3823
@dulanjanaliyanagama3823 3 жыл бұрын
Great and simple explanation in a nut shell. Thank you very much.
@impavid8125
@impavid8125 2 жыл бұрын
Absolutely incredible explanation. Easily understandable.DN
@aria5389
@aria5389 Жыл бұрын
You really explained the algorithm clearly. Well done, bro. Big thanks
@jonathanbryanschmalhofer9889
@jonathanbryanschmalhofer9889 4 жыл бұрын
Best explanation I could find so far, thanks.
@taivinh1986
@taivinh1986 3 жыл бұрын
the best explanation for this complicated algorithm
@neutt4630
@neutt4630 4 жыл бұрын
This is the quality content that youtube needs more
@zhendongtian-d7t
@zhendongtian-d7t Жыл бұрын
Such an amazing teacher! Thanks!
@Metaloid-wv4kz
@Metaloid-wv4kz 3 жыл бұрын
QUESTION!: So the whole idea is to make searching faster by not starting at the start of the tree root anymore, and instead use dashed line links to start farther along a branch?
@niemasd
@niemasd 3 жыл бұрын
Precisely!
@ttttully
@ttttully 3 жыл бұрын
Great explanation, thank you!
@jay5929
@jay5929 Жыл бұрын
cool
@katze2775
@katze2775 Жыл бұрын
OMG Very clear!!!! thank you for sharing
@mskiptr
@mskiptr 3 жыл бұрын
Great explanation! Clear and interesting. I have one question however - what should happen to a (sub)string GCT? Since the root has no outgoing T arc, we will fail indefinitely… I guess, it would work if we created some self-loops (between the root and the root) for every otherwise undefined character. Ok, it seems I kind-of already answered it myself. ; ) Still, if someone knows an alternative approach I would gladly read about it!
@niemasd
@niemasd 3 жыл бұрын
You are correct: once you go from root to root via the self-pointing failure link, you just move on to the next character
@elisey2028
@elisey2028 2 жыл бұрын
Awesome explanation! Thank you!
@Squirrelschaser
@Squirrelschaser 3 жыл бұрын
One confusing thing about actually implementing this, I realized, is that you don't "traverse" the suffix links at all when you're querying. The suffix links help you build additional blue links during the linking process.
@niemasd
@niemasd 3 жыл бұрын
Precisely! We discuss "dictionary links" a couple videos later in this playlist
@Gentleman217
@Gentleman217 Жыл бұрын
Hey Niema, Thank you for the video. Which pen and board are you using? I need to buy one.
@niemasd
@niemasd Жыл бұрын
No problem! And I used a Wacom One tablet. Here's a write-up on my process for making the videos: github.com/niemasd/teaching/blob/master/Tutorials/VideosKhanAcademy.md
@Gentleman217
@Gentleman217 Жыл бұрын
@@niemasd Thank you!
@Gentleman217
@Gentleman217 Жыл бұрын
@@niemasd oww that repo is exactly what I'm looking for 😁
@niemasd
@niemasd Жыл бұрын
@@Gentleman217 Awesome, glad to hear it! 🙂
@shiroclown6131
@shiroclown6131 2 жыл бұрын
What's the worst time for building the structure?
@shelleyfan9981
@shelleyfan9981 3 жыл бұрын
very intuitive explanation. Appreciate it!
@ulq7159
@ulq7159 3 жыл бұрын
Over a year later... still a good explanation for useful info, thanks
@niemasd
@niemasd 3 жыл бұрын
I'm glad you're enjoying it! :-)
@steniorj
@steniorj Жыл бұрын
Great video. Thank you! I have one question, though. When building a failure link for the character "c" you said we must choose the longest suffix. Why is the longest suffix in "CAT" and not in "GCG"? Does that mean that "AT" is a suffix longer than "G"?
@niemasd
@niemasd Жыл бұрын
Can you specify the timestamp? I'm not 100% sure which step you are referring to. In general, we draw a failure link from node u to whatever prefix in the tree (i.e., whatever path starting at the root) matches the longest possible suffix of u (i.e., the longest possible path ending at u).
@steniorj
@steniorj Жыл бұрын
@@niemasd I get it now. Thank you my friend!
@widezenico
@widezenico 2 жыл бұрын
great!, this video is very fun :D
@BlastinRope
@BlastinRope 3 жыл бұрын
Good video and succint presentation
@rc_woshimao957
@rc_woshimao957 Ай бұрын
PROFESSOR NIEMA LE GOAT
@zhongyuanchen8424
@zhongyuanchen8424 Жыл бұрын
what is the time complexity for building the automaton?
@niemasd
@niemasd Жыл бұрын
Using a clever algorithm, we can construct an Aho-Corasick Automaton in O(mk). These slides might be helpful (starting from Slide 60 for the efficient Aho-Corasick Automaton building algorithm): web.stanford.edu/class/archive/cs/cs166/cs166.1166/lectures/02/Small02.pdf Note that "m" in the slides equates to our "n" (the length of the long string), and "n" in the slides equates to our "mk" (the total number of characters in all of our short strings)
@zhongyuanchen8424
@zhongyuanchen8424 Жыл бұрын
@@niemasd Thank you so much for taking your time to find me the reference. You are truly incredible. Much appreciated.😀
@utkarshkathuria2931
@utkarshkathuria2931 3 жыл бұрын
Is there a rule like only the patterns whose last letter is the leaf node, can be said to be detected from the given input string?
@niemasd
@niemasd 3 жыл бұрын
No, it is not necessary for the last letter of every word in our database to correspond to a leaf node: it is possible for words to end at internal nodes, and we will still be able to detect them (there will be examples of this in the later videos on this topic). It just so happens that this specific example only has words that end at leaves
@human-011
@human-011 Жыл бұрын
I cracked it, its om 🕉 😂
@anushalihala6404
@anushalihala6404 3 жыл бұрын
Great explanation! What happens with the database contains words of different lengths? for instance if 'a' was in our database but after 'cat' we took the failure link to 'at' skipping 'a'
@niemasd
@niemasd 3 жыл бұрын
Great insight! There would be a "dictionary link" from the node ROOT->C->A to the node ROOT->A, so as you scan the word "CAT", you would start at the root, traverse the C edge, traverse the A edge, see there's a dictionary link (and output the word corresponding to the dictionary link), and then continue down the T edge
@manpreetsingh4106
@manpreetsingh4106 4 жыл бұрын
This is very helpful
@ayushbachan6113
@ayushbachan6113 3 жыл бұрын
great explanation
@shahrozsaleem3471
@shahrozsaleem3471 11 ай бұрын
How will this automaton help in the following scenario? I have keywords to search are: - GATEWAYS - ATE I want to search them in the following text: “The gateway is long” Here when we will be at vertex from an edge ‘y’ then we will not have any child so we will go through failure link to root. But during this we missed that ATE existed.
@niemasd
@niemasd 11 ай бұрын
Excellent question! This is actually described in the next video of the series. Specifically, we will be able to catch those cases via "dictionary links": kzbin.info/www/bejne/hXeuqYp8mtySgpI
@KanagaveluSugumar
@KanagaveluSugumar 4 жыл бұрын
Excellent!! Thank you!!!
@richardharris9708
@richardharris9708 3 жыл бұрын
Good explanation but would be good to mention the dictionary links too!
@niemasd
@niemasd 3 жыл бұрын
Great insight! They're discussed in the next video in the series :-) kzbin.info/www/bejne/hXeuqYp8mtySgpI
@idan307
@idan307 3 жыл бұрын
How would you cover a case like a string of "mnopqr", and strings to find: "mnop" and "no"? what are the failure links gonna look like?
@niemasd
@niemasd 3 жыл бұрын
If the only words in the Aho-Corasick Automaton are "mnop" and "no", there is a failure link from "mn" to "n" and another failure link from "mno" to "no"
@calebdecoux2676
@calebdecoux2676 3 жыл бұрын
@@niemasd With your video example, we will just match "mnop" and miss "no". Do we just have to check the failure node every time to see if it's also a word node? In video example it doesn't cover this situation where a word is matched while you're traversing in a different path.
@drunken_parrot
@drunken_parrot Жыл бұрын
thank you so much
@julianferres
@julianferres 4 жыл бұрын
Amazing and clear explanation. Good Job!
@amitavamozumder73
@amitavamozumder73 3 жыл бұрын
what if there's no g in teh words, but i got a g in the stream, should i keep checking failure links till i hit the root and discard that letter?
@niemasd
@niemasd 3 жыл бұрын
Exactly: you would keep following failure links until you hit the root, then you would follow the root's failure link, and just discard that letter
@amitavamozumder73
@amitavamozumder73 3 жыл бұрын
@@niemasd thanks, one more question, if i am finding subsequece instead of substrings and i find a non existent character, i should just stay in last failure link right?
@rahulbawa3969
@rahulbawa3969 4 жыл бұрын
Great explanation. Can you please tell what is this black board that you used in this video ?
@niemasd
@niemasd 4 жыл бұрын
Sure! Here's a complete guide of how I created these videos: github.com/niemasd/teaching/blob/master/Tutorials/VideosKhanAcademy.md
@rahulbawa3969
@rahulbawa3969 4 жыл бұрын
@@niemasd Wow! Thanks for a quick reply. Really appreciate it.
@codejunky9831
@codejunky9831 Жыл бұрын
wow so neat :)
@Jacked_Nerd
@Jacked_Nerd 4 жыл бұрын
Thank you
@sovanraksa2112
@sovanraksa2112 4 жыл бұрын
Absolutely awesome👏 keep it up bro
@chahatagrawal1441
@chahatagrawal1441 4 жыл бұрын
your explanation is awesome !!!
@shameekagarwal4872
@shameekagarwal4872 4 жыл бұрын
that was really good... a 3 minute pseudo code walkthrough for failure links would have been good though
@somiljain7670
@somiljain7670 4 жыл бұрын
Keep up the good work.
@alamjim6117
@alamjim6117 4 жыл бұрын
Thanks It was great explanation
@ShubhamKumar-qz6yz
@ShubhamKumar-qz6yz 4 жыл бұрын
I am confused, ACC and GCG also exist right? But we didn't find those while going through the main word
@niemasd
@niemasd 4 жыл бұрын
I'm not sure what you mean by "also exist": ACC and GCG exist in our database of short sequences, but remember that we *only* care about short sequences that occur in our long query sequence. Because our long query sequence didn't contain ACC and GCG, neither were found during our search (which is the behavior we want)
@ShubhamKumar-qz6yz
@ShubhamKumar-qz6yz 4 жыл бұрын
@@niemasd ooh I see. Thank you for the explanation. Makes a lot more sense now. 🙏🙏
@ketan_sahu
@ketan_sahu 4 жыл бұрын
Amazingly explained. Thankyou.
@ИванЛитвинов-н8ц
@ИванЛитвинов-н8ц 4 жыл бұрын
Thank you very much man
@jessenguyen5305
@jessenguyen5305 4 жыл бұрын
very understandable
Advanced Data Structures: Aho-Corasick Dictionary Links
5:02
Niema Moshiri
Рет қаралды 27 М.
16. Strings
1:24:30
MIT OpenCourseWare
Рет қаралды 126 М.
If people acted like cats 🙀😹 LeoNata family #shorts
00:22
LeoNata Family
Рет қаралды 41 МЛН
Smart Sigma Kid #funny #sigma
00:33
CRAZY GREAPA
Рет қаралды 36 МЛН
Ctrl+F on steroids - Aho-Corasick Algorithm (pt. 1)
13:54
ComputerBread
Рет қаралды 984
Advanced Data Structures: Suffix Array Search
12:33
Niema Moshiri
Рет қаралды 16 М.
8 Data Structures Every Programmer Should Know
17:09
ForrestKnight
Рет қаралды 198 М.
Advanced Data Structures: Burrows-Wheeler Transform (BWT)
7:26
Niema Moshiri
Рет қаралды 35 М.
Advanced Data Structures: K-D Trees
8:47
Niema Moshiri
Рет қаралды 42 М.
Understanding B-Trees: The Data Structure Behind Modern Databases
12:39
Knuth-Morris-Pratt algorithm
10:00
SpookyAlgorithms
Рет қаралды 153 М.
АиСД. Ахо-Корасик от Никиты.
45:19
Дискреточка и алгосики
Рет қаралды 822
2. Алгоритм Ахо-Корасик
1:27:20
Computer Science Center
Рет қаралды 3,3 М.
If people acted like cats 🙀😹 LeoNata family #shorts
00:22
LeoNata Family
Рет қаралды 41 МЛН