What Are Bloom Filters?

  Рет қаралды 121,576

Spanning Tree

Spanning Tree

Күн бұрын

Why are bloom filters such useful data structures? How do they work, and what do they do? This video is an introduction to the bloom filter data structure: we'll explore what they are, how they work, and build an understanding for why they're so efficient.
***
Spanning Tree is an educational video series about computer science and mathematics. See more at spanningtree.me
To be notified when a new video is released, sign up for the Spanning Tree mailing list at spanningtree.s...
Spanning Tree is created by Brian Yu. brianyu.me/
Email me at brian@spanningtree.me to suggest a future topic.

Пікірлер: 236
@randomyoutuber4500
@randomyoutuber4500 Жыл бұрын
For some people, this may look crazy and inefficient. The Book store is just for an example. This bloom filter algorithm is often used for anti-virus scanning, vulnerable URL detection etc. where speed is an important factor. This situation is like you have billions and billions of books in the bookshop and you have a new customer coming in for every millisecond. storing and retrieving data is the process which slows down computers and running an Algorithm costs 'almost' no time.
@travcollier
@travcollier Жыл бұрын
Bloom filters are also used a lot in genomics. The "bookstore" is all the different sequences of a given length (we call them kmers) in the sequencing data.
@jamesdaus
@jamesdaus 4 жыл бұрын
If you re-hashed the final copy and flipped all the bits off, you'd risk marking a book that is in stock as not in stock because of an overlap. My first thought is to hold a counter on each number so for example #221 would count how many times its been flipped on, and when you take out a book, you drop it down 1, and this will allow two books to overlap and one to be removed without problem, but maybe this is a bit memory excessive? Is there a better known solution?
@SpanningTree
@SpanningTree 4 жыл бұрын
Exactly the right idea! This is what's known as a "Counting Bloom filter."
@ashwinalagiri-rajan1180
@ashwinalagiri-rajan1180 2 жыл бұрын
This is called a Count Min Sketch
@jeremycohn9992
@jeremycohn9992 2 жыл бұрын
What is meant by overlap? Do you just mean a collision?
@jamesdaus
@jamesdaus 2 жыл бұрын
@@jeremycohn9992 Two books sharing at least one bit, so yes a collision
@leonm8906
@leonm8906 Жыл бұрын
@@SpanningTree how would you handle an overflow? Because if you only use 8bits per "bit" that might get overflown rather quickly, but using 32bit would be less space efficient so that you have way more cache misses. on a database server I think this would matter a lot right? Since you would have possibly thousands of search querries per minute if not much more.
@mazharmumbaiwala7451
@mazharmumbaiwala7451 Жыл бұрын
Its a shame you're not making videos any more, you are an amazing creator hope to see more new content someday soon
@AidenC.
@AidenC. Жыл бұрын
New video 2 months ago
@rhysbaker2595
@rhysbaker2595 Жыл бұрын
Looks like the channel is starting to pick up some attention. If he keeps posting semi-regularly, I can see this being a pretty successful channel
@jaywebster624
@jaywebster624 Жыл бұрын
is that like 5 euros in EU
@vishal-shinde
@vishal-shinde 4 ай бұрын
​​@@jaywebster624it's 0.44 Euros I think. He donated in Indian Rupees
@pratikkarangle
@pratikkarangle 2 жыл бұрын
Why did you stop making videos? You are just too good at explaining, please keep posting...I love your narration, animation, basically everything about your videos.
@Galakyllz
@Galakyllz Жыл бұрын
I wish he would keep making these excellent videos. They're great!
@RandomPerson-id5rb
@RandomPerson-id5rb Жыл бұрын
well
@rhysbaker2595
@rhysbaker2595 Жыл бұрын
I don't think he has stopped, these sorts of videos can take a very long time to produce, especially if you have another job or other commitments
@erner_wisal
@erner_wisal Жыл бұрын
​@@LOLWHATBRO it's 3 month since the last vid
@ananttiwari1337
@ananttiwari1337 Жыл бұрын
@@LOLWHATBRO His video about Propositional logic is 3 months old
@grapetoad6595
@grapetoad6595 Жыл бұрын
I came into this video thinking it was going to be about the graphical processing of bloom, I was then pleasantly surprised by the actual content.
@carlosdelamorenacoco8715
@carlosdelamorenacoco8715 6 ай бұрын
This was the clearest and simplest explanation of such a complicated concept. hats off
@Lehander100
@Lehander100 3 жыл бұрын
You’ve just got seen most underrated explanation
@_goldfish
@_goldfish Жыл бұрын
what
@mikezhao1838
@mikezhao1838 2 жыл бұрын
This is such a fantastic illustation of the bloom filter. I instantly got the idea after watching the nice robot moving around(or not). Also the follow up question is interesting. Sad that no new videos recently. But keep up with the good work!
@themannyzaur
@themannyzaur Ай бұрын
went back to the wikipedia article and now things are clicking this is such a fantastic explanation!
@firebraingames
@firebraingames 3 жыл бұрын
Great channel, good animations, good narration. Simply explained.
@sanjaymatsuda4504
@sanjaymatsuda4504 Жыл бұрын
My first thought is to make a "negative" bloom filter. That is, keep the normal bloom filter, but also have a second filter whose bits you'll only turn on when the last copy of a book is removed. When someone asks for a book, you'd first check the negative filter: if all the bits corresponding to that book are turned on, that means you don't have the book. Otherwise, proceed to look at the normal filter: if all the bits corresponding to that book are turned on, that means you have the book. Though of course, if selling out a book is a frequent occurrence at the store, there will be collisions in the negative filter, causing false negatives, where you end up claiming to not have a book that you in fact do have.
@psychopompous489
@psychopompous489 Жыл бұрын
But what if you want to re-add a book that's previously been removed?
@sanjaymatsuda4504
@sanjaymatsuda4504 Жыл бұрын
@@psychopompous489 That would be the time to completely reset both filters and repopulate the normal bloom filter with the hashes of all the books you have, plus the re-added book. Either that, or have a "double-negative" bloom filter where you turn on only the bits corresponding to books that are back in stock after having been sold out previously.
@benjackson7690
@benjackson7690 Жыл бұрын
@@sanjaymatsuda4504 That wouldn't be a good idea. We can't just keep adding bloom filters, because at a certain point, we do have to have a way to remove an item from the bloom filter. Also, we can't repopulate the bloom filter, because it would be incredibly inefficient to do that every time we sold out a book. We need some kind of way to reset the bits only if no other books are triggering them, but without having to search through out entire catalogue of books
@user-jn4sw3iw4h
@user-jn4sw3iw4h Жыл бұрын
@@benjackson7690 Usually true, and indeed not the solution hinted at in the video. However, it's not universal. A trade-off is in consideration here. for example - when will you be re-adding a book - how big is the collection - what errors are acceptable? If we take the book-store example of the video: usually there is a single/specific moment books are added to the collection. And it usually isn't during opening-hours. if your collection is small enough (or the hash fast enough, which it should be, it is the 3rd thing it is optimized for. after 'same input _must_ result in same output' and 'different input _should_ result in different output') having a re-index session scheduled after taking in new inventory is an option. It's not faster than the 'counting bloom filter'-solution that was the 'intended answer', but if the count can exceed 3 it is smaller. Also, which errors are acceptable? While this solution would come with the downside of 'customer returns a book' will not show up in the system until tomorrow's re-indexing (likely a sufficiently rare occurrence that having the 'full list at the desk' for that set, is a viable strategy: check a list of 3 books, while waiting for the computer to run 5 hash-functions). The original will have errors caused by stolen books or (other) administrative error, linger forever, while this one resolves those every 'delivery'.
@nguyenduy-sb4ue
@nguyenduy-sb4ue 2 жыл бұрын
OMG, arguably the best explanation of Bloom filter, i have ever come across
@wissemaljazairi
@wissemaljazairi Жыл бұрын
If we replace the true/false values in the bloom filter structure with a counter that is incremented whenever a new entry is added to the filter and decreases whenever an existing element is removed... This way we return the counter to 0 if we remove the last book. 0 is equivalent to false, and > 0 is equivalent to true.
@jolie1543
@jolie1543 Жыл бұрын
that takes much more space tho
@clayupton7439
@clayupton7439 Жыл бұрын
@@jolie1543 smaller data type, or build your own custom data type by stacking bloom filters until the stack represents the maximum value in base-2.. ex: 3 stacked bloom filters can represent a maximum of 7 duplicates... 001 010 011 100 101 110 111
@AntonioNoack
@AntonioNoack Жыл бұрын
@@jolie1543 You could use saturated addition, and maybe 2-4 bits for each entry. If an entry is deleted with a saturated counter (the hope would be that is occurring very rarely), you'd have to rebuild the whole index.
@localidiot4078
@localidiot4078 Жыл бұрын
@@jolie1543 not really. A computer associates bools with 8 bits because that's the smallest piece of data they can work with. Also off the top of my head, I think the bloom filter size just needs to grow with the size of the array to the root of however many hash functions you have, which can be arbitrarily large.
@localidiot4078
@localidiot4078 Жыл бұрын
@@AntonioNoack i don't think you can reconstruct this data structure, information is lost.
@starman4572
@starman4572 Жыл бұрын
Green Robot: "Can I get this book please?" Blue Robot: "We're all out." Green: "Can you check?" Blue: "...no."
@raz0229
@raz0229 3 жыл бұрын
Best explanation so far for beginners to understand. Subbed!
@GoddamnAxl
@GoddamnAxl 2 жыл бұрын
How about requiring only a percentage of bits to be on for a positive. Then when you delete an item you just turn all of its bits off. Also, when someone comes in and you find a true positive, you turn on all it's corresponding bits to make up for the potentially mistakenly turned-off bits. The good side of this method is it's minimal memory footprint, the downside is requiring more hash functions (more cpu). Also, it's possible that there can be false negative. How do you improve this?
@nkugwamarkwilliam8878
@nkugwamarkwilliam8878 Жыл бұрын
This is one of the most underrated channels on youtube. Your content is amazing
@hamzahal-qadasi1771
@hamzahal-qadasi1771 2 жыл бұрын
One of the best explanations for the Bloom filter.
@quemaspana
@quemaspana 2 ай бұрын
This is a really great explanation.
@styrex4261
@styrex4261 Жыл бұрын
I work at autozone and always wondered how their inventory was so fast, considering the size of it.
@heap-sort
@heap-sort Жыл бұрын
The best explanation which I have ever seen. Thank you.
@sammokkabasi8957
@sammokkabasi8957 3 жыл бұрын
I'd use three different arrays, one for each hash function, to make the bloom filter better.
@valoeghese
@valoeghese 2 жыл бұрын
Uses more memory, but definitely reduces the chance of a collision
@localidiot4078
@localidiot4078 Жыл бұрын
@@valoeghese There is probably some law in statistic that explains this in math terms, but I think that's equivalent to increasing the array size by 3, or some constant. Like it's not quite true, but splitting into 3 arrays doesn't actually improve much because of the birthday paradox, which is what your really fighting.
@zg7897
@zg7897 2 жыл бұрын
best video of explaining bloom filters!!!
@iwersonsch5131
@iwersonsch5131 Жыл бұрын
Instead of a single bit, you can assign each element in your bloom filter a variable ranging from 0 to 255. When you add a book, you add 1 to each associated variable, and when you sell the last copy of a book, you subtract 1. This uses 8x as much data but scales almost infinitely.
@iwersonsch5131
@iwersonsch5131 Жыл бұрын
If a value is at 255, tough luck, you can't add or subtract from that variable anymore. Let's hope that never ends up leading you to run to the shelf needlessly
@SenoritaTJ
@SenoritaTJ 6 ай бұрын
Thanks for explaining it in this fun and easy way. The funny thing is I just finished my second week in a book store as a sales person.
@dhivyashri6554
@dhivyashri6554 2 жыл бұрын
i have my finals tomorrow and because of you i'll at least know enough to bs, thank you so much
@danielmelville1729
@danielmelville1729 11 ай бұрын
Great explanation, I appreciate the help :)
@shreyash245
@shreyash245 6 ай бұрын
thankyou ,☺it was clearest explanation
@JB-ev8lt
@JB-ev8lt 2 жыл бұрын
Very well explained and enjoyable animation to watch!
@flargensnirkel
@flargensnirkel Жыл бұрын
Yay going over inserting a book title into 5 giant machines that I have to stand on a stool to get to and counting a series of hexagons to make sure all five of the numbers are turned on will be so much faster than pressing ctrl+h on a google doc
@mjcox242
@mjcox242 Жыл бұрын
The problem I see with 1 bloom table multiple hash function is it's eventually going to increase collision as say you had 100bits and 50 hash function, the bloom filter doesn't care about order so in the space of 5 books there's a probable chance you filled up the filter.
@localidiot4078
@localidiot4078 Жыл бұрын
I agree, however if you have a big enough space this isn't a problem. You could just add a constant multiple to the size of the array to offset this. Perhaps just the number of hash functions + 1
@benjackson7690
@benjackson7690 Жыл бұрын
You're right that more hash functions causes the filter to fill up faster, but that's why it's important for you to build a filter large enough. When calculating how large a filter should be, we need to consider both how many items we are storing, as well as how much space they take up on the filter. This is entirely up to the programmer, although there may be some suggested algorithms to determine size
@GoodLuck-dv2zu
@GoodLuck-dv2zu 4 ай бұрын
Amazing explanation. Appreciate your effort!
@damamoot2291
@damamoot2291 Жыл бұрын
The problem would be that removing bits when the last of a book is sold would mean that if any other books you had shared any of the bits removed, your way of checking if you have a book would make it seem like you don't have that book, even though you do. My idea was to count how many times each bit was activated; like, if you get a new book and one of the bits is already activated you just count that it is the second time activating the bit, and then when one of the books is sold out you just reduce the counter for each of that books bits by 1. Then, the bits that you get from multiple books will now still be on after one of those books sells out and you will still keep track of the other book(s). (I checked other comments and it seemed to confirm that this is a valid solution.) P.S: I know my explanations suck, I am autistic, lol.
@arif3012
@arif3012 2 жыл бұрын
The best explanation so far 1:40 onwards
@KelvinShadewing
@KelvinShadewing Жыл бұрын
Did anyone else get recommended this and thought it had to do with graphics?
@NathanHedglin
@NathanHedglin Жыл бұрын
@deepaksoundappan3244
@deepaksoundappan3244 2 жыл бұрын
Beautiful work..,,👍👍👍.. great effort made in making this video
@anthonyto3
@anthonyto3 2 жыл бұрын
Fantastic explanation and animation. Thank you so much!
@lastmomenttuitions2.067
@lastmomenttuitions2.067 9 ай бұрын
Amazing Explanation
@zc4060
@zc4060 Жыл бұрын
I think increasing the number of hash function shall actually increase the chance of collision? Assume you have an infinite number of hash functions, then adding one "book title" to you bit grid would in fact set all bits to one. The next time you lookup another "book title", you will absolutely get a positive, no matter you have it on the shelf or not.
@derekhu2678
@derekhu2678 Жыл бұрын
Hi Brian, love your videos! Keep it up!
@cmyk8964
@cmyk8964 Жыл бұрын
You would ideally feed the book’s ISBN number instead of something prone to human error or variance like a title, because a hash function will return a completely different result for “Hamlet” vs. “hamlet” vs. “Shakespeare’s Hamlet” for example.
@benjackson7690
@benjackson7690 Жыл бұрын
The bookstore was more of an example instead of real. If it was though, you could still probably do the book's title, but you would have to implement some algorithms to clean up the user input of course
@lare290
@lare290 Жыл бұрын
wouldn't increasing the number of hash functions eventually lead to _more_ false positives? imagine the extreme case where you have as many hash functions as you have bits in the filter, and each function gives a different output, therefore one title will turn on every bit on the filter, subsequently making _every_ book come up positive.
@ccflan
@ccflan Жыл бұрын
​@@leeroyjenkins0 i was thinking in the same idea, so this means you should have an idea about how long your library would holds before calculating your filter bites size, unless you dynamically resize the table if needed
@user-jn4sw3iw4h
@user-jn4sw3iw4h Жыл бұрын
5:00 This is the point where trade-offs come into play: Which costs more time: - performing 5 hashes (instead of 3) for every incoming request. - going into the back to 'fail to get' the book, once in every approximately 10.000 requests.
@gregsak707
@gregsak707 3 жыл бұрын
Great video, thank you!
@danielkaczmarczyk2482
@danielkaczmarczyk2482 3 жыл бұрын
Great channel - love your videos.
@akulaharshitha1714
@akulaharshitha1714 2 жыл бұрын
best explanation
@mohamedadnanealkhiati5006
@mohamedadnanealkhiati5006 4 жыл бұрын
I don't understand why would we actually need this algorithm, I mean if we were to code this why wouldn't we just have instead of bits the books? Aren't we just adding more steps? Doesn't this require just a simple hashtable that points to books? I think I am missing something, but I don't know what exactly. Thank you for this amazing videos, Brian. I am a fan! edit: Or is it if we already have a data structure not implemented as a hash table, but we still want to retrieve data efficiently?
@SpanningTree
@SpanningTree 4 жыл бұрын
The motivation here is memory efficiency - you certainly could just store all the books themselves, but that would take far more memory than storing the bits needed for the bloom filter. For storing large amounts of data, this memory efficiency can matter quite a lot!
@misiekeloo6114
@misiekeloo6114 4 жыл бұрын
@@SpanningTree medium.com/analytics-vidhya/cbfs-44c66b1b4a78 Also this article helped me to understand the idea of avoiding collisions during removing (the task at the end of the video).
@arpit-jain
@arpit-jain Жыл бұрын
Even I have the same question. The memory is so cheap nowadays and compact as well. Even if we save 1 million books in a hashtable or other data structure, still memory is affordable. And searching through million records with today's processors computing power won't take more than a second
@randomyoutuber4500
@randomyoutuber4500 Жыл бұрын
@@arpit-jain The Book store is just for an example. This bloom filter algorithm is often used for anti-virus scanning, vulnerable URL detection etc. where speed is an important factor. This situation is like you have billions and billions of books in the bookshop and you have a new customer coming in for every millisecond. storing and retrieving data is the process which slows down computers and running an Algorithm costs 'almost' no time.
@Llorx
@Llorx Жыл бұрын
Great video. Still this increases the amount of branching, which is also slow. Benchmarks are needed for each situation.
@youtubeuniversity3638
@youtubeuniversity3638 Жыл бұрын
5:00 Could also make separate blooms for each hash perhaps.
@alex_zetsu
@alex_zetsu 3 ай бұрын
Is there a difference between using 1 hash function that maps to a space N^3 (say 1,000,000,000) and using 3 hash functions that map to space N (say 1,000)?
@Tooley641
@Tooley641 3 жыл бұрын
Well done!
@anuragbisht1200
@anuragbisht1200 Жыл бұрын
important point that , bloom filter is a probabilistic data structure. there could be false positives but no false negatives.
@sebbes333
@sebbes333 Жыл бұрын
4:20 What if instead of treating those values as 3 different numbers, instead you treat them as coordinates in a 3D space?
@denzelchukwuebukaachonu
@denzelchukwuebukaachonu Жыл бұрын
Yes! that would further decrease the chances of a collision.. exponentially. It still wouldn't negate it but my would it come close! This is brilliant.
@localidiot4078
@localidiot4078 Жыл бұрын
​@@denzelchukwuebukaachonu I disagree, the core tradeoff between the has functions and collisions is space. If you have enough space you can give a 0% chance for a collision. Or, if you have no space every function can collide, the encoding of the data doesn't matter. You can interpret any 3d coordinate system as a single number if it has a finite size. X+Y*maxX+Z*maxX*maxY. In effect your just cubing the space requirements of the data structure.
@benjackson7690
@benjackson7690 Жыл бұрын
This doesn't really solve anything though. It's the same as just adding a larger filters. In the end, you're still treating them as three different numbers
@DeannaEarley
@DeannaEarley Жыл бұрын
How does this differ from say a Perl hash, php associated array, a dictionary, etc? Is it the same lookup, but no actual data beyond the yes/no?
@noyz-anything
@noyz-anything Жыл бұрын
I'd use one bloom filter for each hash function, rather than just having the singular grid. that way, there's even less of an overlap risk
@ultralowspekken
@ultralowspekken Жыл бұрын
Wouldn't using multiple hashes become redundant at some point? Let's say, for example, you have book A that uses bit 1 and 2, book B that uses bit 1 and 3 and book C that uses bit 2 and 3. If book B and C are present, then book A would give a false positive, something that would hardly have happened if only one hash was used. So there is a correct ratio of hash per book somewhere. Wondering if a solution might be using hashes for different aspects of a book rather than title only. A hash for the book, a hash for the author, a hash for publisher, a hash for language and so on. It would reduce the risk of having omonimous books, as there are less authors than books, less publishers than authors, less languages than publishers etc. It would also require different bit tables but we're trying to reduce the risk of a false positive here.
@toshevislombek
@toshevislombek 3 ай бұрын
5:55 I create new bloom filter for deleted items, and I look for deleted BF if it is negative, I assume item is not deleted yet
@theresapan6128
@theresapan6128 Жыл бұрын
Thanks!
@SuperMichael996
@SuperMichael996 Жыл бұрын
Excel + CTRL+F problem solved ;) Really though, why resort to oversimplifying it into bits when you can just store a table (or even a text document) with the names of the books, the number of the shelf they're on, the number of the row and the position on the book on that shelf? (i.e. 3 for the third book from left to right) Giving our technology nowadays, I don't think the cost difference is noticeable.
@rokaq5163
@rokaq5163 Жыл бұрын
Seeing english speakers call it "Don Quixoite" always cracks me up. It's way more convoluted than the original name for no apparent reason than just translation errors. It's "Don Quijote", and it's pronounced like so: "don key hoh teh" The first "h" in "hoh" should be exaggerated, as if you had something stuck to your throat and were trying to spit it out. That's the spanish sound associated to the letter "j".
@noviian3092
@noviian3092 Жыл бұрын
without audio, it seems as if two people are playing a board game but one player is trying to cheat with an increasingly ridiculous amount of machines
@porple4430
@porple4430 Жыл бұрын
How would this be more efficient than simply looking up the book? Isn’t the bloom filter just another way of organizing the books, because you have one hash value for every book (disregarding overlaps)?
@lennarthorney3466
@lennarthorney3466 3 жыл бұрын
best video :infinity:)
@argothiel
@argothiel Жыл бұрын
Does using multiple hash functions really improve the situation? It's true that you have three times as many checks than in the original example to prevent the collision, but at the same time you fill up three times as many bits.
@pies765
@pies765 Жыл бұрын
You already allocated a block of size n to fit the entire set of possible output values of the hash function. So as long as that size doesn't change as you add functions, you haven't taken up any more space
@nightfox6738
@nightfox6738 Жыл бұрын
It would take a bit more memory but changing each bit in the bloom filter to a count would solve the problem as each time you add a book to the bloom filter you increment all of the indices of the hash functions and decrement them when the book is removed. This would however add a higher count when multiple copies of the same book are added which could cause scalability issues as now you're not just counting collisions (which could ideally are very minimal) but you're counting all copies of any book where one of its hash functions produces the number cooresponding to the index. This number could end up getting quite large requiring more and more bits to store so while its better than the list of books at the front desk, its still not ideal. Another option would be to recalculate the filter every time you sell a book but as you could imagine that will have quite some overhead. Ideally you'd have a combination of both, only incrementing the filter if you don't already have the book so you're keeping track of collisions and not count of the same book, then when you sell a book, you check how many of them you have (as they should all be in the same place) and if you sell the last one, you decrement the indices in the bloom filter. I think this is the best solution as in most cases it should not require more than one byte for each index and in much the same way as many hash map implementations, you can dynamically resize the bloom filter to minimize collisions.
@benjackson7690
@benjackson7690 Жыл бұрын
I had essentially the same thought about using integers rather than bits, but you seem to have already seen the downside. I believe in C++ at least, ints are 8 bits, which would then increase the size of your bloom filter 8x, which could be incredibly expensive depending on how large it is. This also has a downside of introducing a maximum number of books, because a int can only count so high before it runs out of space. You also mention recalculating the filter when you sell a book, but that would be incredibly expensive, as every time you removed a book, you would have to search through your entire storage again. If you had to do that, you might as well have not had a filter.
@nightfox6738
@nightfox6738 Жыл бұрын
@@benjackson7690 in C++ ints are usually 32 bits but you can use char for 8 bits. It really depends on your expectations though. Do you forsee ever having over 255 of the same book in your store? If not, then this would work. You mention that would require 8x the size and that's true, but multiplying by a constant isn't a huge issue (similar to runtime complexity O(n) is basically equivalent to O(8n)) If you're running out of space with n books its conceivable that it may not be too long before you run out of space with 8n
@cupcakeapplesauce
@cupcakeapplesauce Жыл бұрын
Would it be useless to have a one to one hash function? I'm guessing having as many bits as there are books is equivalent to having a list with all the names.
@larakayaalp1676
@larakayaalp1676 Жыл бұрын
isnt this just equality optimized indexing? it’s still O(n) to determine inexistence, you could instead categorize the elements to make it faster, if you have text you “sort” it (sorting is expensive you’d just have a different list for each text’s first character) which i think is just a btree
@epimolophant
@epimolophant Жыл бұрын
I have the feeling that searching through an index would be a lot faster than using multiple hash functions for every query
@gamekyuubi42069
@gamekyuubi42069 Жыл бұрын
I thought this too. He should have gone into explaining why this is faster than a list of books, because for a human it's not that much better but for a computer it might be a world of difference.
@benjackson7690
@benjackson7690 Жыл бұрын
A hash function is a very quick usually 1 step math equation. It takes almost no time to do. On the other hand, searching a list takes a long time depending on the size of the list and how it is sorted. The fastest possible time to search a list is log(n) time, where n is the amount of items in our list. So it would take log(n) time to confirm a book is not in our library. On the other hand, using a hash takes 1 time, regardless of the amount of items in our library, so we can easily ensure we don't have something without searching for it.
@Lampe2020
@Lampe2020 Жыл бұрын
If the set of books isn't TOO large I would simply not remove the book that's out of stock from the filter and accept the few false-positives and in regular intervals (providing that the set of books has changed since the last reindes) simply reindex them.
@hwstar9416
@hwstar9416 11 ай бұрын
how is this different from a bitmap index?
@fishum6483
@fishum6483 Жыл бұрын
"Look at this, your piece of paper is spanning the entire desk, not very space efficient, better stick 3 1x1x3 metre machines in the room and fill half your desk with a hex grid"
@ThatJay283
@ThatJay283 Жыл бұрын
if you used counters instead of booleans, could removing be done? like when the first copy is added, add 1 to each counter. when a last copy is removed, remove 1 from each counter.
@lancemckenzie1074
@lancemckenzie1074 Жыл бұрын
Numbers are bits, mate, just fancier bits. At the end of the day, it has to be expressed as a number of bits. If you want to count up to eight uses of each array bit (i.e. Eight book titles result in hash codes that use that bit), then you need: four bits to represent "0", "1", "2", "3"...and "8".
@lancemckenzie1074
@lancemckenzie1074 Жыл бұрын
Suppose that'd be zero through seven...
@ThatJay283
@ThatJay283 Жыл бұрын
@@lancemckenzie1074 well yeah lol, numbers are just groups of bits. i should've just said "counters instead of booleans".
@benjackson7690
@benjackson7690 Жыл бұрын
That would be a solution to it, but it has a 2 main problems. The first is that it introduces a max amount of times that bit can be added to (integer limit). This problem is unlikely to occur though unless you are using extremely massive arrays, so we can probably ignore this. The second problem is that you now have an array that is 8 times the size of the original. It still retains relatively the same speed, but it is now a lot larger which can be an issue.
@naveenda2064
@naveenda2064 3 жыл бұрын
Great, keep doing
@Jellyfishhie
@Jellyfishhie Жыл бұрын
Wouldn't this approach require that your search term be exactly the same as what was used to generate the bloom filter hashes?
@1vader
@1vader Жыл бұрын
Yes, ultimately, the book store is only an analogy. A real world use case could be something like checking if a person has liked a video where thee "search value" would be the user id which is ofc exact. Or you could for example assume we search by ISBNs instead of book titles.
@xxqq5719
@xxqq5719 4 ай бұрын
Before Amazon, bookshops were a nightmare to deal with...
@weirong1159
@weirong1159 Жыл бұрын
when we use multiple hashing methods, wouldn’t that increase the odds of collision if we have a lot of books in our library already? I was thinking that we could have multiple bloom filters and to check them instead? This would be less space efficient though.
@alessandrobalducci4961
@alessandrobalducci4961 Жыл бұрын
when you have multiple hash functions, it's true that each hash function could have a collision, however, for the bloom filter to give a false positive you would need for every hash function to have a collision, which is less likely. If there was a collision on a single hash function you would notice that (as not all the bits corresponding to your value will be flipped). I saw this data structure (or actually a similar one) in one of my uni courses but I didn't remember what it was called, I guess now I know it was called a bloom filter 😂
@weirong1159
@weirong1159 Жыл бұрын
@@alessandrobalducci4961 thank you!
@estrawitch
@estrawitch Жыл бұрын
1. turn off the bits 2. other books might share the same bits so if u turn the bits off there will be a false negative 3. use one hash function that has a lower chance of collisions (quality over quantity basically)
@travcollier
@travcollier Жыл бұрын
I'm not sure if it is part of the formal definition or not, but I've never seen Bloom filters discussed where they weren't guaranteed to never give a false negative. That's often quite important to the applications where they get used.
@ThePCguy17
@ThePCguy17 Жыл бұрын
Marking bits by which hash function they came from would help, though that would require multiple lookup tables. And it wouldn't really make removing an entry easier, just less likely to completely remove an item from your tables. And since not having all three hash functions returning a result on your table is a sign of a false positive 'less likely' doesn't help much. Honestly, the only kludge I can figure out is re-entering every book into the system after removing any book to make sure they're still in correctly, but that would be cumbersome enough to be impractical at best.
@coding3438
@coding3438 2 жыл бұрын
Excellent video but still wondering how we went from writing down book names on a piece of paper to suddenly having arrays of memory cells on the table!
@lancemckenzie1074
@lancemckenzie1074 Жыл бұрын
You only have to search one entry on the array - - the green bot asks for a book and identifies the hash code you need to look at. With a long paper list, you don't know what number entry has your book on it, so you need to search every one.
@theNeuo13
@theNeuo13 3 жыл бұрын
What about using the output of the hash functions as coordinates for the bit instead of the number of the bit?
@softpaw6234
@softpaw6234 Жыл бұрын
The problem with that is that, like most of this channel's content, this is supposed to be an analogue for a computer science concept, so a computer would need a separate function to find the coordinates whereas it's already built to be able to quickly find a numbered position in memory. Your suggestion would be easier for most humans though.
@USSOR-media
@USSOR-media Жыл бұрын
You could also solve this problem with a computer. A computer is an electronic device that embodies a complex and intricately interconnected network of hardware and software components, capable of processing vast amounts of data and performing complex calculations with stunning speed and accuracy. It leverages cutting-edge technologies, including microprocessors, memory chips, storage devices, and input/output interfaces, to execute a wide variety of operations, ranging from arithmetic and logical computations to multimedia processing and artificial intelligence. Its operation is governed by a sophisticated set of protocols and algorithms, which are designed to ensure optimal performance, reliability, and security. The computer's remarkable capabilities are a testament to the ingenuity of human engineering and the power of modern science, and its impact on our daily lives is nothing short of revolutionary.
@benjackson7690
@benjackson7690 Жыл бұрын
I assume your remark is being sarcastic, but just in case, this video is saying how you would do it with a computer
@peculiar-coding-endeavours
@peculiar-coding-endeavours Жыл бұрын
That's a ChatGPT-level comment if ever I saw one :-D
@kasturikakatkar
@kasturikakatkar 3 жыл бұрын
Very well explained!
@torazis3286
@torazis3286 Жыл бұрын
So it's a list of all books with extra steps
@mrmastaofdesasta6994
@mrmastaofdesasta6994 Жыл бұрын
First, I would use a 2D or 3D bloom filter, depending on the number of hash functions and then use the results of these functions as coordinates for each book. Since it is just a bit array, it shouldn't require that much space. Then you would probably get a more reliable way of uniquely identifying a book than marking multiple bits.
@benjackson7690
@benjackson7690 Жыл бұрын
All this is doing is increasing the size of the filter. It does decrease the number of false positives, but you're still going to get them. Actually, adding more hash functions is a much better way of preventing this (to a certain extent). Consider a 100 bit array. If we add 1 book using 1 hash function, another book has a 1/100 chance to land on the same spot (1%). If we add 1 book with 3 hash functions though, another book has a 3/100 chance of the first hash landing in a lit spot, a 2/99 chance of the second hash landing in a lit spot, and a 1/98 chance of the third hash landing in a lit spot. If we go ahead and multiply all of these, we get a 6.18e-6 chance of matching all 3 spots (0.0006%)
@samirrimas
@samirrimas 2 жыл бұрын
Very well explained mate thank you
@lightweight1889
@lightweight1889 Жыл бұрын
You'd just be indexing your table. How's looking up hashes any more efficient than looking up strings of book titles?
@surlo69
@surlo69 Жыл бұрын
why not just have all the books in one list and their value would just be their index number instead of generating a value for them that might have already been used by another book?
@dojelnotmyrealname4018
@dojelnotmyrealname4018 Жыл бұрын
I feel like this is just using a hash function to convert human readable titles to database ID's. As for removing items: There's a couple of ways, with varying levels of scaling issues. The main problem you have is that if your own stock causes a hash collision, turning off a bit can cause you to mistakenly believe the hash is unrepresented. The first solution is to make sure your own stock does not cause a hash collision. While calculating your hash tables, input an extra variable, and start over with the variable incremented every time a collision occurs. This works, but it also scales horribly with set size. If your hash table is .01% filled, that means that every datapoint has a .01% chance to cause a collision (technically 1/9999, but let's not get overprecise). The advantage of this method is that it only has to be done when a new member is added to the set. So if new members are relatively infrequent, this can be an option. But if you update the set frequently you'll spend more time recalculating your hash table than you did fetching the data. To reduce the amount of recalculating massively, you can commit larger amounts of memory to the hash table than a single bit per member. For expedience's sake you want to use fixed length blocks. Then, when calculating the hash table, instead of simply flipping the bit on, you increment the location. This way, your hash table tells you how many members use each single hash. When you run out of stock on a member, simply decrement the location again. This method uses more memory space, but needs to recalculate much less frequently. Even two bits instead of one allows for 3 members to share a hash. You can even combine both solutions if your system has downtime and you don't mind burning energy. While the system is not actively looking up information, it can recalculate the hash table to fit it into a smaller memory footprint. If a particular spot has 5 members, and the system has idle time, it can attempt to recalculate the hash table to reduce the count back down to 2, and then reorganize the data to fit in 2 bits instead of 4. My suggested rules would be: If a counter is at it's max, reorganize the filter to use the next power of 2 bits per spot. If the highest count is above 2, and there are no pending requests, recalculate the hash filter until all counters are less than 3. If no counter is filled up to half - 1(countMax => 0b011...1), reorganize the filter to use the previous power of 2 bits per spot.
@lancemckenzie1074
@lancemckenzie1074 Жыл бұрын
Didn't you have to do that back in the 90s - trigger memory compression, that is? An automatic reorganisation routine sounds like garbage for disk longevity. I think at small numbers of entries, most of the array is wasted space. Can we make the existing array do more work by using data which is not a count number? Perhaps a two bit addition to each array bit considers "rotation" of the bit, so that each array bit is more expressive.
@eitantal726
@eitantal726 Жыл бұрын
It IS a hash table, until the point where you combine multiple hashes and logically OR them into the same map. THEN it becomes a bloom filter.
@weckar
@weckar Жыл бұрын
So how is an array of bits you have to check with possible overlaps better than an ID list of all books in the system? Lookup time would be the same or better with the ID list?
@denzelchukwuebukaachonu
@denzelchukwuebukaachonu Жыл бұрын
imagine you have thousands of books in the library, if someone requests for a book that is not present in the library, the only way you would be able to confirm that it is not present in the library when using a lookup table would be to check every single last entry in the lookup table. From the first entry all the way through thousands of entries and all the way to the last entry before you can be sure that it isn't in the library. However, when using an array of bits, you simply have to input your title then check the bit location that is outputed for truthfulness or falsehood.
@weckar
@weckar Жыл бұрын
@@denzelchukwuebukaachonu I dont get how that's not the same...
@lare290
@lare290 Жыл бұрын
@@weckar the size of the bloom filter scales with something like the logarithm of the dataset, so when the dataset grows to billions upon billions of entries, the bloom filter only grows a little bit. it'll be much faster to search because it's much smaller than the dataset.
@DerMarkus1982
@DerMarkus1982 Жыл бұрын
You missed a good opportunity here to put a certain book in spot #42 of your Bloom Filter Array...
@lancemckenzie1074
@lancemckenzie1074 Жыл бұрын
Yeah, Do androids dream of electric sheep?
@nagasatisha1
@nagasatisha1 2 жыл бұрын
nice
@rayoflight62
@rayoflight62 Жыл бұрын
I leave you with this question: How your subscriber and other viewers can convince you to make new videos? All the best in any case, & Greetings...
@noodlish
@noodlish 10 ай бұрын
Great explanation but what's going on with that wall?! I feel like it's encoding something I'm not smart enough to spot.
@amiriqbal1689
@amiriqbal1689 Жыл бұрын
This seems that it would not work. Suppose : Book A - Hash Value : 1,2,3 Book B - HashValue : 4,5,6 12,3,4,5,6 will be turned on. So now, any book with hash value : (2,3,4) or (4,5,1) will seem like we have it but we do not.
@alessioolivieri5460
@alessioolivieri5460 Жыл бұрын
So, to delete a book we might use counters instead of bits. When a book is removed you decrease counters by 1
@anjarwicaksono229
@anjarwicaksono229 Жыл бұрын
If you remove it, then just create another bloom filter to record what things that have been removed. If you found both in the first and the second bloom filter, it means it was there but It is no longer there. If it is found only in the first filter, it is still there. Limitation and challenge: when a book is restocked.
@1vader
@1vader Жыл бұрын
This doesn't really work because now, there's a chance that the second bloom filter will by chance indicate that an unrelated book that just happens to hash to the same bits was removed. Bloom filters are only useful because they tell you for certain, that you don't have a book when its bits aren't on and the fact that they can sometimes incorrectly indicate you do have a book isn't a problem because you can just check to make sure. But in your example, assuming the first bloom filter indicated you do have a book, if the second bloom filter doesn't indicate a book, it means you definitely have it but if it does indicate a book, you might still have it, so you have to check either way and the bloom filter doesn't provide any value and you'll continue to think you still have all the books you used to have but sold. Or alternatively, if you just assume you don't have a book anymore if the second bloom filter indicated it, you now may tell customers you don't have a book even though you do, which is a property that the original doesn't have. It's always correct.
@adituta8660
@adituta8660 3 ай бұрын
If you want to delete an element from the bloom filter, you have to remake the filter
@syedqasim81
@syedqasim81 Жыл бұрын
Orr make the list in an excel sheet and just Ctrl+F
@kucingoyen1
@kucingoyen1 Жыл бұрын
Lol
@NathanHedglin
@NathanHedglin Жыл бұрын
O(n) time oofda
@garrylove8955
@garrylove8955 Жыл бұрын
This is similar to CRC error detection
@Ggdivhjkjl
@Ggdivhjkjl Жыл бұрын
1:33 Looks like an Aussie senate ballot paper.
The Mathematical Danger of Democratic Voting
8:14
Spanning Tree
Рет қаралды 1 МЛН
Understanding B-Trees: The Data Structure Behind Modern Databases
12:39
Brawl Stars Edit😈📕
00:15
Kan Andrey
Рет қаралды 53 МЛН
АЗАРТНИК 4 |СЕЗОН 3 Серия
30:50
Inter Production
Рет қаралды 1 МЛН
Остановили аттракцион из-за дочки!
00:42
Victoria Portfolio
Рет қаралды 3,6 МЛН
What Is the Pigeonhole Principle?
8:23
Spanning Tree
Рет қаралды 3,3 МЛН
Bloom Filters | Hashtable | System Design
12:56
ByteMonk
Рет қаралды 1,7 М.
Bloom Filters
11:31
mCoding
Рет қаралды 54 М.
OpenAI’s New ChatGPT: 7 Incredible Capabilities!
6:27
Two Minute Papers
Рет қаралды 167 М.
What are Probabilistic Data Structures: Bloom Filters
9:17
Can You Always Win a Game of Tetris?
6:33
Spanning Tree
Рет қаралды 514 М.
Harder Drive: Hard drives we didn't want or need
36:47
suckerpinch
Рет қаралды 1,7 МЛН
Langton's Loops: The cellular automaton that copies itself
12:01
davbrdavbr
Рет қаралды 513 М.
Brawl Stars Edit😈📕
00:15
Kan Andrey
Рет қаралды 53 МЛН