 of a New York City-based startup called B12 that's focused on the future of creative and analytical work. And today I'm gonna draw for you a few sketches of sketches which has absolutely nothing to do with my day job. So in short, a sketch is a data structure that summarizes some data set and it does this by being a little bit less accurate than perfect, but in exchange for taking up a lot less space. And you've already worked with sketches in the past that are really accurate. So if you've ever counted something or summed the data set, then you've worked with a sketch. In the space of just a single integer, you can summarize how many items are in a data set, up to data sets of size, whatever, 4.2 billion. But there's other questions that you might have about your data set, like how many items of a particular type are in the data set, or is a particular item in the data set that you can't answer completely accurately without storing the entire data set. And if a data set's really large, that might be a prohibitive problem. So today we're gonna talk about two sketches, one called a bloom filter and the other called account min sketch that help answer these questions approximately. Let's start with bloom filters. So a bloom filter answers this question of whether a particular item X can be found in my data set D. And if you've ever played around with something like Google's Chrome, then you've interacted with a bloom filter. You wanna go to some URL, you might go to badurl.com and you want your browser to stop you from going there if it's in some list of malicious URLs. Luckily, Google and its data centers has a few gigabytes worth of bad URLs in some data set. But that's way too large a data set to distribute to every Chrome installation. So Google has packaged those up into a bloom filter that takes up only a few tens of megabytes and that's something that you can actually distribute to every browser. Let's see how it works. So at its core, bloom filter is a bit set. It's an array of bits that all start off with the value zero as well as a collection of hash functions. So in this case, we have three hash functions and they'll map the key that we're looking to insert or look up onto this bit set. So let's see it in action. I wanna insert badurl.com into my bit set for my bloom filter. I'm going to hash badurl.com against the three hash functions. Those three hash functions are going to map me onto three locations in my bit set and at those locations, I'm gonna flip the bits from zero to one. Now let's say I wanna add another badurl.com into the bit set. I'm gonna hash another badurl.com against the three hash functions, find myself at three new locations in the bit set and flip those bits to one. And you'll notice here actually that one of the hash functions has mapped another badurl.com onto the same location as badurl.com. So anything that was previously set to one just remains a one. Now let's look up a key in this bit set. Let's say that we're going to goodurl.com. We are going to hash goodurl against the three hash functions, find ourselves in three places in the bit set and we know that if goodurl.com was previously inserted into the malicious URL list, then all three locations would have been ones. But in fact, one of them is a zero. So we have 100% certainty that goodurl.com is not in the malicious list of URLs. But it doesn't always work out so well. So let's say we're going to niceurl.com, which I can assure you is just a nice URL. It's not in the malicious list of URLs. And unfortunately, when we hash it three ways, we end up at three locations that overlap with other URLs that are in the malicious set. So we see three ones and we're led to believe that niceurl.com is actually a malicious URL. This is called a false positive. And we can summarize Bloom filters as guaranteeing to us that an item is not present in the data set if we see any zeros when we look it up. But Bloom filters also have this really small chance of telling us that an item is present in the data set when it's actually not. And the question is, how small a chance do we have of hitting a false positive? Luckily, there's some math that can help us with this. There's a formula that maps the number of bits in the bit set, so basically the size that we've allocated to this bit set, onto two different variables. One is the number of items that we'd like to insert into the bit set. So the more items, the more likelihood of having false positives. And the other variable is the false positive rate. But I'm not expecting you to look through this formula and parse it. I've actually made a handy-dandy chart for you that shows that you can achieve really low false positive rates. Something like one in every 10,000 lookups will result in a false positive in exchange for just 20 bits per item that you're inserting. And to make that concrete, let's imagine that we want to achieve a false positive rate of one in every 10,000 lookups. We've allocated 20 bits per item that we're going to insert into our data set. And let's imagine that this is our malicious URL list. There's 10 million URLs. Each of them takes up, on average, 30 characters. Well, if you multiply that out, just then the size of the strings alone in this malicious URL set is greater than a gigabyte in size. So it's prohibitive to send that over to every browser. But the Bloom filter at 20 bits per item only takes up 24 megs. And so it's really easy to transfer that to everyone. Let's jump into the next sketch called the count min sketch. And this sketch helps us approximate how many of a particular item appear in our data set. And to motivate this, let's imagine that we've crawled the web. We have the entire web corpus at our disposal. And we want to know approximately how many times each word appears on the web. So everyone in this room is cool. We all know the most popular word on the web is going to be cats. But there's other words that we might want to approximate. So there's cats and bats and flats and mats and drats. And we want to know approximately how many times each one appears. The problem is that there's a lot of unique terms or words on the web. And so we're going to have to approximate this. So let's see how that works. We again have a collection of hash functions just like in our Bloom filter. But unlike in the Bloom filter, we're not mapping those hash functions onto a single array of bits. Instead, each hash function maps us onto its own row of counters. So in this case, we have three hash functions. We'll have three rows. And each of the hash functions maps onto its own row of counters. So let's insert the word cats into this count min sketch. We encounter the word cats. We want to increment its count by one. So we'll hash cats against the first hash function, the second function and the third hash function. Find the location that these hash functions map us onto in their individual rows. And take the value that was previously there. So in the first row, it was a 12. We'll increment it by one to be 13. Now let's use this count min sketch to approximate how many times the word flats appears. So we'll hash flats into the three hash functions, be mapped onto three different locations, one per row, and we end up with these three numbers, these three counters, a five, a seven and a 12. And we want to get the approximate count of the word flats. Well, what does each of these counters represent? It represents the number of times that we've incremented the counter for the word flats, as well as any other word that is unfortunately overlapped with that word by the hash function. And so each of these numbers is actually an overestimate of the number of times that the word flats appears in our data set. We'll take a min of them to get the most accurate approximation, in this case five, and that's where the name count min sketch comes from. So in summary, a count min sketch is going to provide us with an overestimate of the count of items in a data set, and that overestimate tends to be more meaningful for the frequent items, the heavy hitters in our data set. The intuition here is, if you're a really popular item, you're going to drown out all the other accounts that land on the same location as you, and if you're a really infrequent number, then you're going to get drowned out by the heavy hitters. So some final thoughts. We've reached the end of the talk, unfortunately. The first is that today we talked about two sketches, a bloom filter and a count min sketch, but there's lots of other fun data structures that help us approximate various properties of our data sets. The one thing they all have in common is they have these wonderful names like the hyper log log and the T digest. You should use sketches whenever you have some really large data set that you want to summarize in a small amount of space or if you have an unbounded stream of data, let's say you have network packets that are coming in out to infinity and you'd like to summarize it in a bounded amount of memory. With that, I want to invite you to embrace randomness. So in computer science and in software engineering, we're trained to think about how a system will either work in 100% correct fashion or how it'll break down. But in practice and data, no data set is 100% accurate. And so there's entire fields like randomized algorithms and probabilistic data structures that help us embrace randomness to give us really accurate approximations of what it is we're looking for. So with that, let's get you away. Thanks.