What are Bloom Filters?

What are Bloom Filters? - Hashing

Рет қаралды 106,247

Күн бұрын

Пікірлер: 77

@othmanee.h4738 5 жыл бұрын

Straight to the point, excellent content ! Allow me to add a quick reminder for myself in case I forget and for others in case they want to look into the details, probability requires a bit of mental warm up sometimes to fully understand : A bloom filter can be seen as F = {0,1}^m For each word, hashing means setting k indexes in F to 1, such as F[ h_i(w) ] = 1 where h_i is the i-th hash function in { h_1, ... , h_k } We do this for all words in the dictionary/list to input Initially we put all values of F to 0, so F={0}^m Then we apply one hash like h_1(w) = some_index So now F[ some_index ] = 1 Then the probability of finding a bit at one, is literally one in m since there is only a single 1 in F: p1 = 1/m So that the probability of finding a bit at zero would be the inverse : p0 = 1 - p1 = 1 - 1/m (**) So after applying all k hash functions to the first word the probability of finding a bit at zero becomes : p' =(1-1/m)^k Then after hashing all words (or until the iteration i of the list of words), the probability of finding a bit at zero will be : p=(1-1/m)^(k*i) So now the probability of getting a one for a random word w, p(F[h_i(w)]=1) is the inverse of p which is : q = 1 - p = 1 - (1-1/m)^(k*i) Therefore if we want the probability of getting a false positive = an error, we want the probability of getting k times the value 1 in succession. So that F[h_1(w)] = 1 AND F[h_2(w)] = 1 AND F[h_3(w)] = 1 ....AND F[h_k(w)] =1 But we know from previous that p(F[h_i(w)]=1) = q = 1 - (1-1/m)^(k*i) Finally the probability of an error is e = p(F[h_1(w)]=1)*p(F[h_2(w)]=1)*...*p(F[h_k(w)]=1) = q*q*q...*q = q^k = ( 1 - (1-1/m)^(k*i) )^k ** In this calculation one could argue that after completing the first word, why wouldn't the probability of finding a zero be p' = 1 - k/m It's a valid remarque, we could say that after applying each h_1(w) to h_k(w) we've been adding a new 1 to F, so p1 = k/m While that is highly possible, since h is built so that each hash gives a different result, it wouldn't be mathematically correct , because that would violate the case where h_1(w) = h_2(w) for instance. (If you assume all other h_i(w) are distinct, then p' = 1 - (k-1)/m in that case) Then what if three hashes are the same, what if any random number n of hashes are the same, this formula would lead to incoherence. => My intuition to solve this question is that when you don't know how many ones are in your filter the experience of drawing a zero can be simulated as not drawing one for all possible values of h_i(w). So the probability of not drawing one (or drawing 0) for the first index h_1(w) is (1-1/m), then for h_2(w) it's the same (1-1/m) etc... until h_k(w) And so drawing a zero when you don't know exactly how many ones there are in your Filter is like not drawing one for the first hashed index, and not drawing one for the second etc... until not drawing one for the last hashed index. In fact the more I think about this the more I find it counterintuitive, I'm not 100% sure how to interpret this formula probabilistically, because when you say the probability of getting a 0 for a single hash, that's one operation, but when you say p = (1-1/m)^k that's the probability of multiple events happening in succession.

@Md_sadiq_Md 7 ай бұрын

I was writing blog on bloom Filters and thanks for this man 🔥

@fiveyearclub6024 5 жыл бұрын

i got bloom filters in an interview. i had read a wikipedia article on it, but i just didn't remember enough to say anything about it. this video was clear! you're a rock star.

@gkcs 5 жыл бұрын

Thank you 😁

@salookie8000 6 жыл бұрын

Excellent comment you made that it doesn't guarantee string has been searched but it does guarantee it has not been searched.

@gkcs 6 жыл бұрын

It takes some time to roll of the toungue :)

@shivas9911 6 жыл бұрын

Are you saying that with the right number of hash functions, the "cc" search should respond as not been searched?

@PiyushSingh-vx7bx 4 жыл бұрын

U r one of the best computer science KZbinr bro

@abhaypatil2000 4 жыл бұрын

This is the best channel for a cs student. We have many youtube channels which solve competitive programming problems. But who will make up for rest of the concepts. Especially when the IIT professors are not so great😅😅

@KshitijSharma-v7b 2 ай бұрын

best video on bloom filter. Gone through multiple videos but my concept got cleared in this

@mairasultan884 6 жыл бұрын

A simple and easy to understand talk. thumbs up

@ThePujjwal 3 жыл бұрын

17:32 Small tip. If sum of digits of a number is divisible by 9 it will be divisible by 9.

@rajeevanand6870 5 жыл бұрын

Nice simple explanation of Bloom Filter. Only he got bit confused during 23:12-23:24. It's simply "1 divided by 1 is 1, and a minus 1 is 0... so on". Anyways good talk.

@rajashekarreddy007 3 жыл бұрын

Yeah, I was looking in the comments section to see if anyone has caught that.

@TheGenerationGapPodcast 2 жыл бұрын

Why point out the negative 0.000000001%?

@sujitkumar-xs2wy 6 жыл бұрын

one of ur''s best lecture ...@gaurav

@sagnikbilltrim4716 4 жыл бұрын

I liked the video very much, except the fact u forgot "p" comes after "o" not "q" :P

@jairajkhushalani99 5 жыл бұрын

Great stuff Gaurav!!

@karanmjanthe2521 Ай бұрын

really well explained

@BRAJAGOP 6 жыл бұрын

Good talk on bloom filter and liked the math part, would have loved if you solved the optimal part as well, that is something for me to look up ;) .

@gkcs 6 жыл бұрын

That math was too complicated to talk about here :) Thanks!

@abhishek_sengupta 4 жыл бұрын

Very nicely explained!! Thanks.

@priyankagrawal9333 3 жыл бұрын

Amazing!! Thanks for sharing!!

@randomizing-life7016 5 жыл бұрын

great video, great explanation on topic that im just starting to get familiar with. but i couldnt lay my eyes off you. first YT engineer lecturer that is handsome like hell.I know purpose of the video is educational and all comments are about your lecture but damn i had to comment your good looks too :)

@gkcs 5 жыл бұрын

Hahaha, thanks!

@shreyageek 3 ай бұрын

no one explains a single question with bloom filter , but its actually asked in microsoft interview

@saaqibz 5 жыл бұрын

Great video and clear explanations, thanks! small note: wouldn't comparing two strings be a complexity of O(min(m,n))? since you don't really need to compare past the end of one string after the end of the other is finished.

@gkcs 5 жыл бұрын

Yes, that's correct. I am assuming n to be the length of the smaller one :)

@Atpugtihsrah 6 жыл бұрын

1/1 is 0 :D Bdw amazing video! Thanks!

@gkcs 6 жыл бұрын

😁

@ayushverma3863 6 жыл бұрын

Well explained Gaurav. I was curious how companies like Facebook handle the false positives while using Bloom filters.

@gkcs 6 жыл бұрын

Thanks Ayush!

@muffinproject 6 жыл бұрын

In the case presented here (caching a search) they probably wouldn't do anything about it. It's just used to cut down on unnecesary caching, NOT to definitively prove a string has already been searched. Let's say that implementing this generates a 'not in set' result 90% of the time, and has a 50/50 chance of being a false positive. Instead of caching 100% of cases, we now toss away 90% of them - saving equally as much storage space. Yes, we do save twice as much as is ideal, but we're saving a lot. Ideally we'd only save 5% but then we'd have to compare the actual string to a very very large array of previous searches which takes longer to compare (and, ironically, use up some storage space in the form of a larger string-array than the bloom array) Compare it a bit to the game 'guess who?' where you have a bunch of people and have to find out whoever your opponent picked. You could ask naïvely; is it Adam? Is it Bertha? Is it Cedric? Is it Diane? - but that takes a long time. You could instead ask; Do they have blue eyes? Are they male? Etcetera. Which reduces the possibilities a lot faster.

@himalayan006 5 жыл бұрын

Very informative, really enjoyed.

@jeffreyqiu5128 3 жыл бұрын

absolute fire

@PrateekMehtaABDFAN 2 жыл бұрын

Great video sir . Huge fan of your content . Quick question : Rather than Multiple bloom filter wouldn't it be better to implement count min sketch or do we have an advantage of using multiple bloom filters over count min sketch ?

@krishnapatni 5 жыл бұрын

very informative and you made it simple to understand. can the bloom filter indicate that the string being searched is a substring of another string that was searched before ? for e.g. if matthew was searched before mat, then m a and t bits have already been set and this shows mat is a substring for some another string already searched..

@satviknema8629 7 ай бұрын

having some trouble understanding this: Probability of finding that a bit is set after 'i' iterations of 'k' hash functions is 1 - (1-1/m)^(k*i) Why is this value not simply (1/m)^(k*i)

@gudduagrawal8523 10 ай бұрын

when i heard 1/1 = 0😶‍🌫

@ruhinapatel6530 5 жыл бұрын

My quest is that basically from what I understood..bloom filter tells u whether the string is searched or not..n if yes then it stores it in the cache..but isn't there a lot of overhead for the string search..and performance issues..

@lakshyaagarwal4044 4 жыл бұрын

The video was great but if you had used Blue instead of Red it would have been better.

@Kavishkhullar 11 ай бұрын

@gaurav (1- 1/m)^(k*i) is the probability of us not not setting a bit after running i string through k hash functions. subtracting that from 1 gives us the probability of us setting the bit to 1. Could you explain "the bymistake" part? how it that the probability of us by mistakly setting that bit?

@vivekmit 3 жыл бұрын

I am not sure about your derived probability formula for error rate in bloom filter but not sure whether the ordering of bloom hashing bit set.... like hashing 1,2 ,3 for any key is (1,1,3 ; 3,1,1; 3 3,1; 333, 111.....any combination of 1,3) all falls under same consideration and treated as key already being searched...I don't think in practice it is a justified in any system design component.

@parvathysuresh2821 4 жыл бұрын

ur expressions are cute and ur explanations also good bro...

@siarez 6 жыл бұрын

Thanks, that's a great explanation! What does 'n' mean @23:55? Is that the size of our vocabulary? Why do we need it here?

@gkcs 6 жыл бұрын

Thanks! N is the number of words we will be searching in the bloom filter. 😁

@7301G 9 ай бұрын

@11:42 I have 1 question if the string is "anas" and the other string is "sana" so sana name string is new but when we make it go through hash function we will get all its bits set to 1 but its never have been searched ?

@mattb9916 6 жыл бұрын

Excellent!

@gkcs 6 жыл бұрын

Thank you!

@aravindm9687 6 жыл бұрын

great explanation! @13:10 how about incrementing the array index value from 1 to 2 to later find out whether the string is matched twice ? Instead of handling multiple arrays

@gkcs 6 жыл бұрын

aravind kumar That's a good idea and while work most cases. However, the second filter could use a different hash function, making collisions even less likely.

@prarabdhgupta1117 4 жыл бұрын

@Gaurav Sen, That link for Cache in the Description is hitting to a dead end, If you can update it, it'll be helpful.

@gkcs 4 жыл бұрын

Done. It's the wikipedia page on caching.

@rembautimes8808 3 жыл бұрын

Interesting idea, and perhaps can be paired with BitSet in Python

@vijayantpawar3137 Жыл бұрын

What is the need of using multiple hash functions?

@bite076_tania3 7 ай бұрын

to avoid probability of collision (two strings having same hash value)

@snehakavinkar2240 4 жыл бұрын

What if we have to check for millions of such words, do we just need to increase the size of bit array?

@gkcs 4 жыл бұрын

To reduce false positives, yes.

@pratyushranjan277 5 жыл бұрын

Hi gaurav bhai..i am a developer but i want to make my programming skills better...is there any algo book or programming book you would suggest..or any other way to improve my logic and programming?? Thanks

@shuvopaul3089 6 жыл бұрын

Expression😘😘

@gkcs 6 жыл бұрын

😛

@shuvopaul3089 6 жыл бұрын

Sir ...you communicate really well...which makes it more interesting

@GelinLuo 5 жыл бұрын

Curious to ask when do you call the `reset()` function to reset your bloom filter that determines whether a string has been searched or not in facebook

@gkcs 5 жыл бұрын

A threshold on false positives is a good idea. Over time, the bits in a bloom filter are all set. Here is a thread discussing some resets: github.com/ben-manes/caffeine/issues/85