 Okay, so I'm going to talk about bloom filters on Redis, like I think I got a decent introduction with what bloom filters are. So this talk probably is 20% code, 80% just theory around what this stuff is, how you can use it, and how we've used it, probably how you can use it as well. I'm shrieks on Twitter just. So the first step I want to get to is the intuition of what this thing is. People like hyperlog log, bloom filters, they're like some magic stuff that compresses data. I don't know how, but yeah. So the first thing that you want to do is that you're looking at is if you want a really fast lookup, like order of one lookup, what do you do? The easiest thing to do is look at a hash table. So it's some sort of hash, but probably a hash is not the right thing when you have like gigs of data or like a billion elements in the row, a billion videos to sort of index sort of thing, because obviously it takes that sort of memory. So what can we do? One easy option is to start, like if I have stuff coming through, I'll drop stuff, like the older stuff I'll just expire and like just have some sort of count or whatever. So that's one possibility. But the problem with that is you've lost data. You don't know whether this thing actually ever existed or not. So maybe if there is a technique to sort of mix them up in the same place, like you have two IDs, for example, stick it up into a same place, would that sort of work? Like would that do some magic? So one of the ways to sort of think about this thing called bloom filter is to sort of think as if it's a holograph. Like have you seen those holographs where like you see two things when you sort of twist this thing around? That's the sort of intuition that you have. Let me, yeah, so there are probabilistic, you'll never know whether that existed or not existed, but like you'll get a proper idea of whether it's there. It's really space efficient. I'll sort of give some numbers later about how efficient it is. It's used to look up. So it's not a general purpose thing. You can't just say, give me all the data about this key or whatever. It's to see whether this thing existed in this set or not. It's as simple as that. That's all it does. Like was this video, was this page in the cache or not? That's all it can say. It can't say, well actually it doesn't say whether it was in the cache or not. It's just like opposite of it. I'll come to that. Basically it's used to look up elements. Simple. So let me set this thing up. Let's say there is a data structure that has 8 bits. I implement the same hash method that we talked about with a modulo 8 sort of hash. Let's say this is the data structure. Initially set it to 0, like everything is 0. And I want to store 10 into it, right? So I do a modulo 8. Obviously this is the 8 bit thing. So this easiest thing to do is modulo 8. I get 2, which means that I can set something in that bit. So just before I go into the, this is all about bit magic. I mean like basically you're setting certain bits up or down based on some logic. So I'm going to set the second bit to 1. Obviously I'm using zero-based indexing, right? So if I ask at this point, was 10 existing? All I have to do is hash again, find the bit, and say, hey, yeah, 10 exists, right? Simple. Let's say I insert another object, say 3. Modulo 3, modulo 8, sorry, I get to 3. I set that bit up. Fine, all good. If 3 comes through, I do a modulo 8. Third element is 1, so I know it's there. Okay, so let's say I hash, now I have 2 coming in. Do a modulo 8, but there is a collision in this case, right? So what do I do? So in this case, what we do is just overwrite that, not overwrite it, just set it to 1 again. So looking up is pretty simple, unhash it, do the modulo 2, it says modulo 8. It says 6, nothing here. So it doesn't exist. So we know for sure it doesn't exist, because that bit isn't there. What about 18? 18's there in index 2. It seems that it's there, right? Still on the same page, obviously it's there. I mean, like, one hit, one bit is there, though we know 18 hasn't come and it was never set, but we think it's there. So this is what a false positive is, right? With this one, with 6, I know for sure it's not there, but with 18 or 2, you don't know. I mean, you'd say it's a false positive. You think it's there, but we can't be sure. So it's a false positive, because at the 2-bit position you can have 2, 10, 18, 26, whatever your range of data is, right? Still making sense? Yes, cool. So these are, I mean, basically that's the intuition. Like you're basically storing multiple things into one place and saying, yeah, like something is there, something is not there. We can be really sure that there are no false negatives. Like if I say 2 is not there, it's not there, definitely not there. If, yeah, obviously we saw that there can be false positives, though. So this was like this dude called John Bloom basically invented it, like around 80s, it's pretty new in that sense. And people have been pretty, like it's been used in a lot of applications, I'll probably come to that. So one thing you see with this stuff is you only have a very fixed amount of memory. You don't have to actually grow too much. If you can afford false negatives, false positives, you can just keep your memory fixed, right? The two things that affect false positives is amount of data, like how much are you sticking in? What is the range of the data? If I have 0 to 100, then obviously the false positives will increase. And the size, the bits that you have. So there's one other thing where we probably come down to code. So we were talking about just one hash function, right? You could actually do multiple hashes. It's kind of easy, I mean like you have two hash functions, first one is the same, second one is model of three. I mean it's easier to do with odd numbers, I guess, right? So obviously 18 doesn't exist now, does that make sense? I'm checking where, sorry, go back. So I stick three in the same example as earlier and I ask, 18 in the previous case was actually a false positive. We found it in this, but in this case with multiple hashing, basically I hashed twice and stored the bits correctly, right? Now 18 doesn't exist. So in general, more the hashing you do, more you get better probability, less the probability of false positives. So that's not exactly true because there is an optimal false probability rates that you'll get. And there's a formula for that. So what John Bloom did is basically devised the math for it. I will probably not go into the math for it, but there are some parameters that you can tune to figure out what the formulas are. So yeah, so we know the number of items we have, which is like the range, in my case probably like 100 million or whatever, and what probability that you want to go to, what false positive probability you want. The thing you do is keep a really small number, which is what is it that I can afford to go back to the database? If my access to database, for example, is 100 times slower, then you sort of tune it based on that, like how many times can I afford to go? I'll probably give you real numbers and then it's more useful. That's the formula. Yeah, there's an easy example, let me. So basically, actually I think, let me just confirm that. I think the number is probably wrong. This is my real number. It's actually million, sorry. I'm sorry in this one. Oh yes, yes, yeah, yeah, but I think I got get the, so for us, so the example that we, like the reason why we had to build this sort of a thing is we had a whole lot of videos where we check are these duplicates or not and the initial seed for that is like close to 100 million, actually more than 100 million. And we said we'll like to go to database once for every 100 requests. Like basically we can afford a 0.01 rate of false positive. So eventually, I mean, this is the theoretical memory. This is not the real memory that we see in Redis. But theoretically, all it takes is 114 megs of RAM. That's all it takes to store this sort of data structure. So it takes seven, so the hash, we do seven times. This is probably the optimal number of hashes we do to get that probability. So we're seeing like a order of 100,000, sorry, three order magnitude difference in RAM usage. So that's like really, really good. I mean, and assuming that we only go to every, like once every 100 requests, it's once every 100 requests to MySQL. I think that's actually protecting our MySQL database as well very well. So one thing to sort of see is bloom filters don't have deletes. If you go back, if I delete an item from the database, how do I actually go? Which bits do I reset or which bits do I not reset? So that's a really hard problem. That's not a hard problem. There's actually counting with bloom filters, which sort of let you do that. But it's not that simple. So this is like a, this is straight up Wikipedia, sorry, no attribution. But basically, this is a false probability function that based on what your M is, like what's your, basically depending on the size, what probability, where do you want to lie in that probability range is what you figure out from that. So who uses it? Have you, have you heard of people using bloom filters? Sorry? Oh yeah, lucky us. So Bitly is supposed to use that. Yes, yes, yes. To see if that actually exists, that you are actually just random stuff here. Google Chrome uses it as well. So there are a whole lot of people who use it. I mean, like in written memory. I'll tell you how we hashed the videos. I think I'll probably even show the code. I didn't understand the question about probability. Sorry, what was the second part of the question? Yes, yes, correct. So, so as your assumption is completely random. So, so in, in the example, I will use CRC, which is like a really fast hashing thing. A lot of people actually just say don't use CRC, because it doesn't have the same statistical property, the same armor or like the, the, I mean, those guys give you like, they can't, the probability of ones in zeros and is exactly the same. We, we don't get that with CRC. So this is for videos, this hashing thing. Right. So, so I'll, so in our case, IDs are the, are what we hash. So video ID or like the, so each one has its own thing. So we just hashed that in. Yeah. Squid uses it for cache digest. Cassandra uses it to protect the SS tables from like basically it has a in memory store and it has the SS tables, which is a disk thing. So it's obviously three times slower. No. Order of three times slower to access the SS tables. So, so there is a bloom filter to protect that. Another type of Bitcoin uses it for payment verification. Basically, it doesn't have to go through the whole payment history over the network to, to figure out was this actually paid, is it double paid sort of thing. So Bitcoins, Bitcoin implementation sort of uses the, I don't know if there is any. Yeah. So the problem is the tuning bits, right? I mean, you can't basically what, how much do you want to go to the, go back to the backend or wherever your SQL storage, right? And I think that's better done at the, what probabilities, what ranges are you talking about? It sort of becomes obvious when I sort of give you an example. But, but I think it's a application level thing rather than like just automatically. I mean, you could do it. I mean, I don't know, but I think it's more make sense what your range is or, or how big your data is sort of thing. And then you make those decisions. So we have a hundred million videos where the whole bunch of metadata in my SQL. We add 200,000 videos every day, 40% of which is duplicated, which means that we're making a whole lot of queries to database. And we figure, hey, that's already there. So, so yeah. And my SQL was really loaded, like it was doing really high QPS. So it sort of made sense to pull that out and at least take out some, some stuff out of it, right? So basically we, based on the, based on the false positive rates that we set up, basically we, like in reality, we saw like 10 requests a day. It went from, from, yeah, 200,000 requests a day. It just went to like 10 requests a day, which is a significant improvement from our, from the MySQL perspective, which means that we don't have to scale my, the MySQL as much, hey. Okay. So, so basically what that, what this means is you ask, is this video existing to the Bloom filter? If it says yes, so if it says no, you obviously go write it to the database, you're all good. If it says yes, we go to the database, check if it's there. But that's a really small number because in our case it's like 1 in 10,000, 1 in 1000. So let me quickly, yes, yes, that's the number we tuned for. So there's a really good site we, I use, I think it's, it's linked here. So, so we call it fast video index cache, right? The key bits are right here, yeah. So, so that's our size. That's the, the M that we decide to, like these are the amount of bits basically, close to like 200 megs. We hash it 13 times. And that's the key store to, yes, 13 times, yeah. So what you're, what you're doing, what you're doing is basically creating a bit mask, figuring out which ones to set 0 and which ones to set 1. I mean, basically which ones to set 1. And then doing an or at the end of this thing. So with the Redis operations, it's not really slow. It's the computation of this hash that's slow. Yes, yes, yeah. But that increases our statistical, this thing of not having too many collisions coming in, too many, these things coming into, yeah, it's, it's more around what my false positive rate should be. Oh, yeah. If we have a positive, we always go to the database. We don't trust that. We don't trust the positives. You can't trust the positive. So we want to lower the false positive. We want to keep the, so if you, as you, as you lower, right, it's a, it's a log operation. So if you keep going, as you keep reducing the false positive, if you make it 0, then it's exactly the same as storing 100 gigs of data. But the change happens. So if you see the, so, okay, so, so this is, this is my, this thing. So assuming, I don't know how many zeros are these. Oh yeah, this is our 100, this is 100 million. This is a false positive rate of 0, 1 percent. So this is basically doing the calculation for me to figure out what's the right, what's, what's the, what's the optimal hashing I need to do to get that false positive. So if I want to decrease my false positive, my number of hashes will decrease as well. It's not changing. No, this is, this is changing the number of times I get negative, right? I mean, like, it's a compliment. Cool, sorry, hope I'm making sense. So I think just an answer to that, some of the things, we can only store 2 raised to 32 in the strings. So that's a limit. That's a hard limit you have with Redis. So you can't go over 4 gigs of data. No, I think it's with 64-bit as well. Number of keys is not what I'm talking about. The size of the string is what I'm talking about. Yeah, with 64-bit, it's 2 raised to 6. So this is just one key with a huge, not 4 GB in this case, it's like 200 gig, 200 megs bit string. So this is the real implementation. Yeah, I mean, so let me go. So basically this is like, if you, if you are familiar with Ruby, I think this will be really easy. So I'll use the CRC thing to figure out what the bit string that's going to be used. Basically, which ones to set one, right? And then writing is about inserting, DB pipeline is basically somebody was talking about just pipelining all the operations. It makes it, it makes the response much faster because Redis is not in the stuck in the request response loop. And all I do is set bit. That's basically exactly the algorithm I showed you like in the beginning. So we also track some statistics around how many, how many elements are there, how many false positives are there. That's how I'm giving all the numbers around. If we had like 10 false positives, like 500 negatives and stuff like that. So this was, so this stuff was done on 2.4. Redis 2.4 didn't have all the bit operations. Bit operations let you do. So in this case, you see like I'm actually getting the specific bits and saying or it or like if all of them are zero, then make it false and stuff like that. You could actually delegate it to Redis itself by doing a bit operations. It's kind of complicated. One reason we didn't go is we were on 2.4. It's the bit operations are only on 2.6. And they anyway are order of n. So we didn't worry too much about it. So you could use bit ops to, so basically iterate over the index, check the bits and say, does this exist or not exist? Why is it better than doing the bit? Oh, okay, okay. Yeah, so this is, so we want to scale this, right? Basically, we have at least 100 crawlers running to figure out fetch all the data. So we don't want to serialize it based just like we don't want to maintain the bloom filter in each of the crawler. We want it one place where it can check, hey, is this video already there? Or is it not there? Basically, the question we ask the bloom filter is, is it not there? If it's not there, we just store to database. If it's there, yes. So we, yeah, it's a huge, it's like probably a 10 machine cluster running, doing all this madness. So yeah, so we can't have bloom filters for each one of them. So we serialize it into Redis. And it seems to give like the bit operations. And I'm pretty sure I'll be more buggy with it than Redis. So yeah, that's pretty much the code implementation. So does it make sense? Basically, we went from like 200,000 requests a day to like, which is two or three requests a day, I think, two requests a second to like 10 requests in a day. So that saves my SQL some real scaling. And it's taking just like 200 megs in Redis. And that's the beauty of this sort of a thing. Oh, yeah, I'm done. Thank you. Questions? Oh, so same question because we want not in process sort of bloom filter. We want it across multiple instances of the crawlers. Like it could be running on the same machine. It could be running on like some other machine. They don't want to be communicating across. Did I answer that question? Anything else?