 All right. Hi. My name is Will, and I work on and with distributed systems and machine learning at Red Hat. I'm Eric, and I work on Will's team, and I work on with Will on the ratanalytics.io community, and we explore, you know, use cases for intelligent apps in a college way. So thanks for joining us for the workshop. Today we're going to talk about sketches, which are these really cool probabilistic data structures that let you get an approximate answer for interesting questions, but that you can run incrementally in parallel or on streams, well only using constant space. So who here deals with big data? I see a lot of, like, uncertainty. What do you think of as big data? Something that you can't process on a single machine. That's a great definition, Marcel. I would agree with that, actually. Like, it doesn't have to be enormous, right? You're not going to, you don't need to get any, you're not going to need a territory. I know you just need to get into something where you have to do distributed computing. This worked. But the, the sort of dirty secret about big data is that everyone sort of wants to have big data, or maybe people don't want to have big data, or feel like they should have big data, but most problems don't require more than a single machine, right? You can, you can scale up for many problems, but there are some problems where you have to, have to scale up, or some problems where you have to use a streaming algorithm, and those are the kinds of problems that you're going to be able to solve with these techniques. So I want to just sort of, as by way of background, introduce the technique the first time that I actually needed to do big data, right? And this is a very simple problem, right? I was always convinced that, like, there's some, some value in being able to have distributed or parallel algorithms, but I never actually needed it to get work done, right? I was like, oh, that might make it faster, that, that might make it better in some way. But the first time that I actually needed to get work done, I was doing compiler research. Um, anyone here have a background in compilers or computer architecture? So you done simulation at all? A little bit. So, So when I was doing this work, and I, not to age myself, but when I was doing this work, what you would do is you would, you would compile a program with your research compiler for some imaginary instruction set that was related to probably the deck alpha or x86 or mips. Um, and then you'd run it on a simulator, and the simulator would be cycle accurate, so the simulator would know how many cycles of latency and memory access was, and so on. But the simulator was extraordinarily slow. Like, if you had one of the spec benchmarks, which are very simple integer and floating point programs, um, running like 30 seconds of wall clock time of a spec benchmark would take you like a week, right, on the simulator. So, what do you want to do? Well, you want to characterize whether or not the optimization you've done improves your cache performance. Well, how do you do that? Well, you run it on a cycle accurate simulator, and you track the cache latencies, right? You track the mean and mean and variance of the cache latencies. Well, if it takes a week to run your simulation, you're generating a ton of data, right? Every time you access memory, you have a cache latency. You're generating too much data to keep around, right? You need to look at each sample once and say, this is going to contribute to my mean and variance, and I need to update my estimates. You can't replay the simulation, you know? I mean, if you're thinking of, like, the textbook method for mean and variance, you pass over the data set once and get a sum of the samples, then you divide the number of samples to get the, you divide the sum by the number of samples to get the arithmetic mean. And these, this is very faint on the projector, but we have a box there, and so in the second pass, you calculate the difference between each sample and the mean, and then you can calculate the variance by getting the sum of those squared differences. Now, if you're doing this on a stream of data that's too big to keep around and it takes a week to generate, do you want to take two weeks just to get mean and variance? I didn't suggest this to my advisor at any point. Yeah, but yeah, so you don't want to do this. You need a way to do this in place. So this technique, the textbook method for doing this works really well for data sets that are small. If you've never really thought about mean and variance at scale, this is what you think of when you think of, if you think of mean and variance to the extent that you think of mean and variance. This is what you think of, right? So I actually discovered at some point sort of accidentally that there's a way to do this for these kinds of data sets that you can't keep around, and it seemed like magic to me. And so my hope is that some of the other structures we'll talk about will seem like magic as well, and we'll see this one in action right now. And again, this is very faint. I apologize for that. But instead of thinking of our population, which we can sort of see if we squint as a set or as a collection that we can pass over as many times as we want, we're going to think about it as a stream. So we'll examine one sample at a time, and after we examine a sample, we'll get an estimate and we'll update that estimate. So here's our mean estimate after looking at one sample. We've seen one sample. So the mean is the same, right? That's an easy case. Our variance is a little trickier. It's undefined. We need two samples for variance. But zero, I would take zero, too, I guess. So when we look at the second sample, we're going to look at its difference from our current estimate of the mean. So we don't need an estimate for the whole population. We just need an estimate for the mean. And I'm representing that here as a blue line, and we'll update our estimate of the mean by sort of splitting the difference between these, sort of getting a weighted average of what we've already seen and what we're currently seeing. So we'll update it by half of the difference, and then we have an initial estimate of our variance. Now, variance has this sort of interesting property that you can take the variance with respect to any number, right? It's sort of invariant with respect to the mean, but the closer that number that you pick is to the mean, the better your estimate will be. So we're going to, we can actually just sort of use our rolling mean estimate to update our variance estimate as well. And that was the thing that seemed like magic to me because I didn't pay attention in that part of the first week of high school statistics. So as we go on, we update our estimates and so on. Here we're doing a weighted average, so we update by a third of the difference for the third sample. And you can see, again, if you squint, sort of how the variance estimate changes over time and how the mean estimate changes over time as well. And we actually land at the end of the algorithm on the actual arithmetic mean for the whole stream. We get a pretty precise estimate of that, precise and accurate, and we land fairly close but not exactly at the variance by doing this in a streaming way. And the cool thing about the summary we get from this online algorithm is we can combine it with other similar summaries, right? So if I have two streams that I want to combine, say I'm running two different programs and I want to get some average from metrics that I'm measuring on both of them, I can get a summary for one and I can get a summary for the other and I can just combine these as weighted averages the same way that I would get an update my estimate incrementally for the case where we're adding a single sample at a time. So mean and variance is cool. If you hadn't thought about mean and variance, it's sort of neat to think about the fact that you can do this. It's not why anyone's here, right? We're done with mean and variance, right? We don't want to talk about this anymore. We don't really want to deep dive into mean and variance, do we? I'm just making sure that we're all on the same page. Yeah, so this is cool, but the reason why I'm showing it is primarily to show this is why I started thinking about this topic a long time ago, but also to show some properties it has that all of the structures we're going to talk about today have. The first one is that it's incremental. You can run this algorithm on a stream. You only need to examine each thing once. If you have one of these situations where you're running a simulation that takes a super long amount of real-world time to run and generates a ton of data, you don't want to keep those metrics around. You want to throw them away. So this is incremental and we can update our summary by looking at a new sample once with a single pass over a data set. The second property we have is that this is a parallel algorithm in that if you have two summaries for subsets of the data, you can combine them to get a summary of their union. That's really valuable because it means that you can scale processing out. And finally, this technique is scalable, which means that whether you process one sample or one trillion samples, your summary is going to use the same amount of space. Yeah, I guess. Thanks. How many people here understand the connection between being able to merge to these structures and being able to compute them in parallel? So if you can easily merge them, then it's possible to process them in parallel. If it was not possible or if it was difficult, then how can you merge them? If you cannot merge the results, then you can process in parallel the parts. Yeah, so if we can compute independent summaries of the subsets and combine those summaries without losing a lot of information, right? Yeah, the spark or before that Hadoop model where you can compute one of those things for each piece of data on each physical node. If there are residing across nodes, then you can combine the results and so it allows you to do pieces in parallel and at the end merge to get a final result. Yeah, so if you had a stream that you were processing in parallel, say with Kafka where you have a topic of messages distributed across multiple machines, you could generate one of those streams for each partition of your topic and then combine them together. Also, if you're processing a stream, you could generate a summary for each day, right? And combine those together. That's sort of another valuable option there. But yeah, by being able to combine these, you can do it in parallel. So the last thing is that this is scalable, though, and it doesn't matter whether we've processed a few samples or many samples. Again, these samples are very faint, but we see that we have a summary of fixed size. So before I started thinking about these kinds of problems, if you would have asked me, what's scalable? What does it mean to be scalable? So if I said I have something that grows linearly with the number, I mean, if I said I have something that grows quadratically with the number of things that it processes, well, that doesn't scale. It's a cubic, no way. If you say linear, maybe, maybe. Logarithmic, yeah, now we're sounding pretty good, right? Logarithmic space, logarithmic time, that's... If I say that's scalable, you're not going to argue with me, right? I'm going to argue. Okay, argue with me more, so. Data coming in, so at some point you will reach a limit. Exactly right. But in this case, if you can slice it, then I won't reach a limit besides the amount of money that I have to buy new machines. But the limit is basically the amount of machines that I'm able to buy. Yeah, so I think that's... And here since we're talking about the amount of space that the summary takes up, if you're saying I want to process log records from a data center for an internet company forever, right? Logarithmic growth is unacceptable because you're going to buy more machines just to store your summary. Marcel, you spend a lot of time thinking about processing metrics, so you sort of have a leg up on this discussion. But in general, like, if you're not thinking about this kind of data processing, logarithmic time or space is fine, right? But for these problems, we don't know how much data we're going to process. It might be so enormous that even something that grows logarithmically is unacceptable. We need something that's a fixed size. And the structures that we're going to look at today let us choose trade-offs between how much space we're going to use, how much constant space we're going to use, and how much precision we want. In answering questions. So that's really, really powerful. And so I think my hope is that, you know, one of my hopes is that we can read the rest of the slides better than we can read this one. But my hope is that sort of looking at some of these techniques, you'll see some of the magic behind them and sort of get inspired to try new things with data. So before we move on, who here is able to, if you're able to deal with a QR code, scan this and open it up on your computer, because that'll take you to an interactive notebook that we'll be using for the rest of the talk. If you're not able to deal with a QR code, just type in this URL. Yeah, if you have a Mac, you can scan it on your phone and airdrop it to yourself. Especially if you're on the Wi-Fi. If it were possible to still buy a Windows phone. Alright, so is anyone having trouble getting to this location here? Alright, so we'll be talking more about that in just a second. Here's what we're going to talk about today, though. We're going to start by looking at, we're going to look at a few different problems. We're going to start by looking at approximate set membership. We'll cover in some slides, we'll talk about the bloom filter, and then we have a notebook where we can play with the bloom filter a little bit and see how it works. The next thing we'll do is talk about counting distinct events. Eric will introduce a data structure called the count min sketch, and go through a notebook where we see how to use that to count distinct kinds of events in a stream. Then I'll cover a data structure called hyperloglog, which enables us to get extremely precise cardinality estimates of very large sets with a very small amount of space. So imagine if you're looking for unique search queries as a search engine, or if you're thinking about what the most popular, or if you're thinking about unique visitors to a popular website. Like you need some way to say how many things are in a set without keeping the entire set around. We'll quickly look at a technique for scalable set similarity called min hash, and then Eric will cover a really cool data structure called the T digest, which produces accurate quantile estimates. So not just mean and variance, but what's the median, what's the 90th percentile, what's the 99th percentile, and so on. So let's start by looking at set membership, and set membership is a really interesting problem. It's fundamental to a lot of other data processing problems, but it also has applications in systems, and we'll start by looking at an application in systems. Let's say you're writing a caching proxy, and you default to caching any web page that someone requests. Makes sense, right? You store it in the cache, you have an eviction policy, and eventually things get expired out of the cache. So as you continue to get requests, the size of your cache is going to grow, and you'll have more things in the cache, but web requests, like a lot of other things, there's a really long tail. Like there are a lot of things that only get requested once. Like the page you get after you complete an online purchase is probably, you're probably only going to land on that once. No one else is likely to land on that exact content, hopefully, and there's really no reason for that to be in the cache. So you're using up resources in the cache, eventually they get evicted, but you're using up resources that you could more productively use for other things. We want a way, instead of caching everything that people ask for, to only cache things that someone has asked for more than once. So the way we can solve this problem, how do we know if someone has asked for something? Well, is it in the cache? But we only want to cache things that are going to be asked for more than once. We need a cache for our cache, a meta cache. But we don't want to keep an explicit set. So if we could keep a list of hashes of things, we could approximate it somehow, and that's actually what it will look like. But if we said we're only going to cache things that get requested more than once, we might have a way to use fewer resources in our cache. And provide better performance. So what we'll do is we'll maintain a set of things that we've been asked for, and we'll only put something in the cache if someone's asking for it. And it's also in the set of things that we've already been asked for. Now, we can't actually keep everything we've been asked for in this set of things that we've already been asked for. Or else we just have, we're just using twice as much space as we would be using if we were caching everything. So we need some structure that says, has this probably been asked for? And that's what the structure that we're going to look at does. The bloom filter is a way to answer this question approximately. Say, given a very large set of things, is something in the set? Can I add something to the set? And this isn't actually a made up example. The bloom filter is a very old data structure. It's almost 50 years old. But the Akamai was able to use this around the turn of the century to really improve the performance of their content distribution network. So this is actually a real example here. But before we look at how we would solve this with a scalable data structure, let's see how we would do it precisely. We want to maintain a set, see if something's in a set. This is maybe going to be going back several years for some of us. But let's think about how we'd represent a set. Well, one way to do it is we could have an array. We have a small number of things. We just have an unsorted array and we assume that the overhead of sorting the list is actually going to be outweighed by just scanning through and looking for the thing that's in the set. If we have a larger set of things that admit some kind of ordering, we could store it as a tree. Let me just say search through the tree and see if we find the thing and if we find it, it's in the set and if we don't, it isn't. Another approach to representing a set precisely though, we could just use a hash table. Hash tables are super useful. And if we store keys where the keys are the set elements and the values are whatever we want them to be. One, true, it doesn't matter. We're just keeping track of the keys here. So recall how a hash table works. You have a put operation where you provide a key and a value. And then we have a hash function which maps from an arbitrary value to a relatively small integer. That hash function is used to look up a bucket in an array where we'll then put that value. And in this case, we're not putting just a value, but we're putting a list of values. Because if we have two things that land on the same bucket, we'll need to put something else there. We'll see what that looks like in a second. Eventually we have a lot of other things in our hash table. And maybe we have an extremely bad hash function that was never used by someone with a contrived example because foo and bar are hashing to the same value. If we land in the same bucket, when we want to put something in that hash table, we'll see that it's not already there. And we'll have to use some more space to handle this collision. So in this case, we had a collision because we had a contrived example. But in general, the size of your hash table is not going to be as large as the size of the space of keys that you're putting into it. So you're going to have these collisions eventually if you put enough things in the hash table. This is review, I'm not surprising anyone so far, right? So when we have one of these collisions though, we pay a time penalty and a space penalty. We pay the time penalty because we have to sort of check all the things in this list. We pay a space penalty because we have to add some things to the end of the list and it's bad on your cache and it's just not great in a bunch of ways. So if we want to compute scalably, we don't want the time to increase or the space to increase. So the Bloom filter is a hash-based data structure that uses a fixed amount of space and constant time. And instead of having those time and space penalties, it has a precision penalty. So when your Bloom filter fills up, you'll get false positives. Let's see how it works. So we want to handle large amounts of data in a fixed amount of space with a constant amount of time per query. So we have a fixed number of buckets. We don't resize, we don't have lists. The values we care about storing are either true or false, so we can use a vector of bits which saves us some space over storing arbitrary keys. When we insert something into the Bloom filter, we use a hash function to look up what bucket we're going to put it in and we set that to true. Notice that nowhere in the Bloom filter are we storing what we've actually put into the Bloom filter, right? We're just setting it to true. So if we have a hash collision, we'll automatically get a false positive, right? If we have something else going into that bucket, we don't know if it's that thing or something else. So the Bloom filter limits the likelihood of these false positives by using multiple hash functions. You may have a collision with one hash function, but you're less likely to have one with several hash functions. And the way this works then is when you put something in, you set all of the values, all of the buckets to true, and when you want to check to see if something is in the set, you return true if they're all true, and you return false if any of them is false. So if we update this again to insert bar, we have one collision out of three, but it doesn't really affect us because the other two have not collided. So when we look up, we're going to look up foo, and this is unfortunately too faint to see, but when we look up foo, we see that all of the buckets that foo hashes to are true, and similarly with bar, if we look something up that's not in the hash table, maybe it hashes to this bucket that's zero, and that we know definitively that we haven't put that in the filter. But we might have something where we have a collision for all three of our hash functions, and in that case, we'll look something up and get a false positive result because even though we haven't put blah into the filter, some other things that we've put collided with all of the hash functions that we'd use for this. So in practice, you'd use more than 16 bits for your Bloom filter, but the larger it is and the more hash functions you have, the less likely you already get false positives, and we'll see how that looks in a minute. So one cool aspect of this that's not going to be super obvious on this slide is that you can merge these together. So if I have two filters, one of which I've inserted foo into and one of which I've inserted bar into, I can use bitwise or to get the union of these two filters, combining them together. So this is where we get the parallel aspect, and this filter is actually going to have, it's going to be equivalent to if we constructed one from all the things that we're taking a union of. So you can estimate whether or not something's in the union of several sets, even if you can't keep those sets in memory. If you're using some scale out compute platform, you can process multiple partitions of a data set, compute Bloom filters for each, and then combine those together to get an estimate for the entire data set that's too big to process on a single machine. And if you're algebraically inclined, or if you just sort of think about these kinds of things, you may be saying, well, I get something interesting with bitwise or could I get something interesting with bitwise and? And the answer is you can. And as you might imagine, the interesting thing you get is an approximation of the intersection of two sets with and. So the interesting thing about that intersection is in this case, if we look up either of these things, we will see that if we have a Bloom filter that just contains foo and one that just contains bar, they have one bit in common. And if we look up either of these, we will see that it's not in that intersection. But in general, that intersection is going to have a higher false positive rate than if we would have done the intersection before we constructed a Bloom filter. All right, so I'm going to. Yes, yeah, when the bloom, I mean, eventually every bit will be set to true and you'll get a false positive for everything. So. Yes, yes, so I think actually. I'm going to pause so there is why we're going to cover it in the notebook. I actually have a slide for a formula, but I'd rather cover it interactively so we can see a plot. Are there questions about the Bloom filter so far? Yes. In general, so there are people who sort of looked into ways to do this. In general, you can't. I mean, there's some approaches that work better than others. But in general, like because you don't know what's in each of these buckets, you'd need to sort of maintain some additional information to try and to try and resize. And you lose or you or you lose some precision or have have false negatives as well as false positives. So you picking the size is important, but we'll see we'll see how to choose good trade offs for that in in a bit. So I'm going to I'm going to skip ahead and talk about an application real quick. And the first application from the paper where Bloom introduced this structure is actually a hyphenation dictionary. So the language I know of where hyphenation I mean hyphenation is complicated in general, right? I know that I know that in many natural languages, you actually have to change the spelling of a word when you hyphenate it. And it's not always it's not always a simple rule to hyphenate something. So Bloom's example was that you have a dictionary of a natural language that has the rules for hyphenating words. And most of the words, like say 90% of the words, you can hyphenate with a heuristic and it'll be correct. But for that 10%, you need to do something special. Maybe you need to change the spelling of the word you need to like hyphenate in a way that's not obvious from the heuristic. Now, this is the part that's going to be hard for everyone in this room to imagine. But neither the dictionary nor the set of words that required special treatment would fit in main memory on the computer Bloom was using. So what do you do instead? Well, you don't want to just keep the dictionary on disk because then you have to hit the disk for every word to see whether or not you needed to hit the disk for every word. The disk is so much slower than memory, you don't want to touch it if you don't have to. So the Bloom filter enables you to say, well, is this maybe something that I'd have to hit the disk for or definitely not. You can say maybe or definitely not. And as long as your false positive rate isn't too high, you'll get much better performance by having this approximate structure to say if something is something you care about. There are a lot of other really cool applications of Bloom filters involving like distributed databases and a lot of systems topics. But I could talk about applications or we could go to the notebook and we could play with it. What are we going to do? It was a question. Thank you, Marcel. The check is in the mail. All right, so I'm going to pause this slideshow here and we'll go over to the notebook and I'm going to load that up and should get something that looks like this. Making it unnecessarily difficult for me to move this onto the other screen. There we go. Okay, and if this doesn't load, I will just access the one that's running locally on my computer. But is anyone having trouble getting into that URL I sent out earlier? Just me. Okay, great. There we go. All right. So what we have here is we have a Jupyter notebook. I'll ask if people in here have used Jupyter notebooks before? Is anyone not used to Jupyter? We will explain what's going on in these. But basically this is a bunch of files that are Python files that are sort of narration and code interspersed together. And I'm just going to click on this one that says bloom filters. If you want to see the mean and variance estimates, you can click on this top one, but we're not going to cover it today. And I'm going to click on this bloom filters and there are two things you need to know about notebooks. The first is press shift enter on any cell to evaluate that cell and move to the next cell. The second thing you need to know about notebooks is that if you get really stuck, you can restart your notebook and clear the state of the Python interpreter that's behind that notebook. So this is running on shared infrastructure and I'm currently tethered to my phone so that I'm not dependent on Wi-Fi. So that's why it's taking a little longer for me to load than it may be for everyone. So again, we have text, we have images, and we have code. And this is just sort of an explanation of the bloom filter. These notebooks are available. You can access this URL at any time. It's running on a public service. You can go back and do whatever you want with these. But we're going to sort of build up a bloom filter implementation starting from a bit vector. So we'll go through and this is just Python code. Are people comfortable with reading Python code? Is anyone not comfortable with reading Python code? Okay, so I'll explain what's going on. And if anything is not clear, if there's something that's unidiomatic I've done in this code, or if it's just new to you, please raise your hand and ask a question. That's what we're here for. So what we're doing here is when we construct a new bloom filter, we give it a size, which is the number of bits in the filter, and we give it a set of hashes. And hashes can either be a... So hashes we can either take as a function that returns multiple hashes or a list of functions each of which returns a single hash. So the actual code for this winds up being pretty simple. The insert method, you look at all of the hashes that you have, and for every hash you set the hash bucket to true when you insert some new thing into the bloom filter. Just like we did in this picture. The lookup method also winds up being pretty simple. We look up the bucket for each hash, and if any of them is false, we return false. And if we get to this point, all of them are true. So we return true. And that's your basic bloom filter right there. It's very cool. Yeah. What about the size? Yeah. So what I'm doing here is I'm taking the hash value right there. You could have said size, but he's just saying... I already created buckets, so I'm just using... Oh, hang on. There's a reason why I'm doing this. Yeah, maybe this is just that I did something dumb in the notebook. Let's pick this up offline. I don't remember. Yeah, so the size of the bit vector is the size in bits rather than the size in integers, that you're actually representing the bit vector as. So that's a great question, and I should have a better answer for it, but since I wrote this code. All right. So we're in the bloom filter, and I'm not going to dive into hashing, but basically if you need a bunch of hash functions, you can always use a hash function that returns a lot of bits and just take a few of the bits in a few different places, and that's just as good as having a bunch of different hash functions. Is everyone willing to believe that? So I have just a utility function that does just that. It creates a bloom filter. So we're creating a bloom filter with 1024 buckets, and we're creating 332 bit hash functions from a larger hash function. And then just as you would expect, if you insert something into the bloom filter and you look it up, it's there, right? If you look something up that isn't there, well in this case we don't have three false positives for this key, it's not there, right? So if we want to look at how the false positive rate changes over time, we can set up a simple experiment. This is one of the cool things about these notebook environments is that you can sort of track false positives over time and see what they wind up looking like. So I'm going to collect the false positive rate every 100 samples, and I'm just going to put a bunch of random bits into this thing, and I'm not going to have too many collisions in the random bits so that if I get 64 random bits, I'm probably not going to get the same 64 random bits twice in the amount of time I have to run this experiment. If I am, I need a new source of randomness. And so if I find the bits that I'm generating in the bloom filter, then I'm going to assume it's a false positive, because I'm going to assume that these 64 random bits are something I'm not going to generate twice in the course of this experiment. So I'm just going to do some stuff here in these cells to set up plotting, and I'm going to run this experiment. And then we can see how the false positive rate increases as the number of unique values increases. So I'm creating a bloom filter with 4,096 buckets in this case, and I'm doing 2 to the 18th values, inserting 2 to the 18th values into it. And this will take a minute to run on the shared infrastructure and probably also a minute to download over my cell phone. But we can see that that false positive rate, as we talked about earlier, gradually climbs towards one, where eventually you're just seeing everything as already in your bloom filter. And so as the number of unique values increases, we have sort of a log scale here, we get an increasingly worse false positive rate. If we increase the size of the filter, we get better results. So let's run a different experiment with a much larger bloom filter, 4 times as many buckets. So while that's running, I'll just point out that we actually have a formula where you can estimate an expected false positive rate for a bloom filter based on the number of hash functions we have, the size of the filter, and the number of things we expect to put into it. So you see that false positive rate is much better, especially considering that here the knee and the curve is around 10 to the third, here it's around 10 to the fourth. So we can sort of calculate our false positive rate with this formula that I'm explaining in this paragraph. I don't want to dive deep into this, I just want to say that this is something you can do. If that formula doesn't make a lot of sense to you, which it doesn't intuitively tell me what to expect, you can plot it, right? So that's what I'll do. Here is I'll plot what this looks like for a given set of, for the same filter size that I was looking at earlier. And as we can see, our expectation of the false positive rate winds up looking a lot like the actual one we got by running the experiment. So that's sort of cool. So here we just sort of show the intersection and the union of the Bloom filter. We have a more involved implementation. And the only thing we're doing is we're adding these intersection, which is just taking the intersection of the bit vectors and the union, which is just sort of taking the union of the bit vectors so that we can run these in parallel. And I think in the interest of getting to some of these I'm not going to spend a lot of time diving deep into those, but let's move on with the rest of the notebook here. And we can see the example from the slides that if we take the intersection of these two, we find that something that's in one of them is in the intersection, something that's in both of them is in the intersection, rather, and similarly with the union. So I talked about how the intersection of multiple Bloom filters can have a higher false positive rate than the Bloom filter of the intersection. And some of the applications you see for Bloom filters, especially if you're implementing one of these in hardware to support micro-architectural features, you really want to have a low false positive rate. So the partitioned Bloom filter is sort of an interesting extension where you have one set of buckets for each hash function. So it's not just that you have a hash collision between any pair of hash functions. It's that you have a hash collision for the same hash function. Does it make sense? So we can see what that looks like, but I don't want to dive into it too much because it's sort of the basis for the next structure that Eric's going to talk about. But the partitioned Bloom filter has much better performance under intersection. And I guess I didn't run this cell before running the cell before running the cell after it. And we can see if we look at the false positive rate under intersection of this example here, the Bloom filter is going to have a worse one than the partitioned Bloom filter. And there's also this cool property that if any of the sets of buckets is empty, then you know that these two sets don't intersect at all, which is something you can't get from the intersection of regular Bloom filters. So, again, with the partitioned Bloom filter, you have one set of buckets for each hash function. And we see here that the false positive rate is much better, in fact, for the partitioned Bloom with the... And we also see that the access here is labeled in a non-fortunate way. So there are some applications, obviously the hyphenation case study that we discussed. There's an application in databases talking about some applications in exposing parallelism in hardware. This is a really cool data structure. It's incremental, scalable, and parallel. It's very simple. We have a reasonable implementation in, like, a screen of Python code, not counting this bit vector that we're delegating to. And finally, it's something that sort of introduces this technique of hashing so that we can have a constant space approximation for an arbitrarily large set of things. So Eric is going to present now a generalization of the Bloom filter for a different application. You're going to get the mic working. Hopefully that's close enough. So, yeah, we're going to take a look at, instead of just set membership, we're actually going to look at, like, object frequencies next. So I suppose you want to, like, identify something like trending topics, like, what are the most popular hashtags currently coming over on Twitter or Instagram? You know, say you're categorizing infrastructure, log messages from a data center, which subsystems return to most errors. For these kinds of problems, calculating an exact answer could actually be prohibitive for a lot of the same reasons that Will just described for the Bloom filter. The actual number of unique objects you're trying to keep track of is much too large. But again, if we're willing to tolerate, you know, getting approximate answers, it turns out we can get a useful answer in bounded space with a different kind of sketch. And as, like, Will did before, we're going to first sort of describe and then reject some precise data structures. You know, if you just have a few elements, you can do something very, very similar to what Will did. Just store, you know, the objects with their counts directly in an array. Obviously that's not going to scale past a few things. You know, with a tree structure that stores these ordered pairs, you can obviously scale farther, but you're never going to scale past, you know, your available RAM on your machine. And you can see that these look very, very similar to the structure for set membership except that we're looking at counts of things instead of, you know, just a bit. Or we can invert that and say it's like, the Bloom filter is actually just like this except you have, you know, one bit numbers. So the next structure I'm going to talk about is called the count men's sketch and it just generalizes the Bloom filter to hold counts instead of bits. So here as a diagram of a partitioned Bloom filter except those are not really single bits anymore. And if we insert something into the sketch, we hash it with several functions and for each one we're incrementing the counter. So the first time you see these things it kind of looks like a Bloom filter. Now I suppose you have, you know, a structure is populated with many counts and I can actually see they're not just bits they're integers. And we use the hash function to find each of the buckets. And we take the minimum. And again, if you imagine that those are, you know, single bit numbers we're taking the minimum with the Bloom filter as well. It's just what we usually call that, you know, bit-wise-and. What's that? Oh, okay. The main, I'm going to talk about this. The main thing is again you can see you can do this in a single slide of code. So it's pretty simple to implement but we'll see that going on in a notebook. So like the Bloom filter you can, you know, define a merge operator, the analog of like addition and get yourself, you know, a combined result. And in this case you combine them simply by summing instead of oring. Another trick you can do is you can take the inner product of these two things. And so you do this, you just sum the results of multiplying in, you know, the individual buckets. So you can imagine it's being very much like the vector inner product. Looks like it wants me to... I can step through this if you want to explain what this is useful for. Okay, so Will's going to be my meta-animator here. And so what this gives you is an estimate of how many things were in both tables. And, well, you know, what that's good for is suppose you have, you know, you're looking at log messages and you're actually hashing on the individual words you saw in your log messages. You can basically say if one of these is like only keeping track of subsystem and the other one's keeping track of purely severity strings. If I want to know how many things were like actual system errors and, you know, that dot product gives me the estimate of that. In this case, it's said I found seven log messages that were, you know, error severity in the LCD subsystem. So another thing you do with these things is keep track of the top K elements. So like the top five, the top 10. And, again, you can imagine, you know, doing trending topics on social media to do this. So how do you get this? Count min sketches and combine it with a priority queue. So you insert an element into the sketch. You get its frequency. In this case, its estimate will be 20. And you get that value and you insert it into the queue. And, of course, priority queues are basically good at keeping track of the top K of values and so you can just keep filling up your queue, possibly not. Oh, there it goes. And so now you can see that, you know, you have the top five keywords. So, you know, Twitter alone gets like a half a billion tweets a day and you can imagine you're never going to be able to keep track of all the unique hashtags and so, you know, if you combine the count min sketch to give you the, you know, the top K of the list, you get, of course, the top K list. Now, these are, of course, going to be estimates, those 1.84 million. It's not an exact count because we saw that, you know, basically it's going to overestimate counts in some cases, but also, you know, once you're into the millions, who cares? It just doesn't matter. So, you know, this is a great application for using a count min to get your top K. Track trending topics so that you can drill down by geographic region. You can actually, you know, categorize by day of the week and combine them. Like to say, you know, what's the trending topics for the weekend, Saturday and Sunday. So, in a similar vein, suppose you were running like a video streaming site, maybe your Netflix. You can see the views of the geographic region. You know, you can say, build a count min sketch and a top K structure for each of the 35 countries in that hemisphere. And you can take the union of all of them or part of them to build up the, you know, exact count of the geographic region that you want. So, obviously, you know, even the Americas are not could ask whether videos that are popular in Boston are also popular in Tulsa or videos that are popular, you know, among Francophones in Quebec, popular among Lusophones and Sao Paulo. And you can get approximate answers to all of these kinds of queries using the inner product trick we just described. You can even, you know, ask questions like, you know, are popular in the northern hemisphere during the summer, also popular in the southern hemisphere in the winter. So, we can go to the notebook. Will is hopefully doing driving for me here because it's his laptop. That was pretty fast. So, again, I'm not going to go over this because we've basically put as good context for the notebook. So, we're going to do some imports, you know, the usual numeric packages. Now, here, you've actually defined the, here you've actually defined it. And again, you can see there's some bookkeeping stuff for the object, but as we saw in that little slide here, the main lookup, lookup, emerge, and insert fit on a single page. And a lot of that is actually comments. So, we'll define that. Here, we'll just declare a count men's sketch. And, you know, if we try to look up something in an empty count men's sketch, you get a zero. That's an appealing property. And if you insert it and you look it up, you get a single frequency answer, also good. What do we have here? So, hash collisions and count men's sketches can lead to over-estate of any counts, not just basically being over-enthusiastic about set membership. So, we're going to design a function very much like the one that Will ran in the previous notebook to see how this distortion can grow over time. So, shift return to declare a an opa parameterization. We'll see how good it runs. And, again, you can see there's like kind of a knee in the curve. If you're reminding what this axis is, it's like percent error. That makes sense. And, so, you know, the actual error is like not bad. But after a while, you start over-estimating by quite a bit. Whoa, whoa, whoa, whoa, whoa. So, if we scroll up a few cells, it explains what's going on. Cumule distribution of the factors that we've over-estimated counts by. So, Oh, I see, I think. So, of all the things that you have in an opa sketch, all of them are over-estimated by a factor of less than 40. So, the vector is on the vertical x-axis. Yeah, so actually the error is the x-axis. So, like, out here you say I over-estimated the count by a factor of 10. So, it's like kind of a lot, actually. But, because it wasn't a very large table and we gave it a lot of data. And, you know, so like, again, the knee of the curve maybe is like here. This is the cumule distribution, the probability that I actually, you know, all, basically everything was less than 5, basically. Okay, so on the vertical x-axis, the probability of being over-estimated? Yes, the fraction. No, it's actually, it's actually the fraction, it's actually the fraction of values we saw that were, you know, over-estimated by less than or equal to something. So, we can see that like, 90% of them are estimated by less than a factor of 10. You know, over half are over-estimated by less than a factor of 3. Oh, look at that, yeah. So, you know, whether or not a factor of 3 is like a real problem, you know, is of course domain dependent, but generally you might want to declare a larger table because you don't actually want to over-estimate by that much. So here we're going to declare a larger table and see if we can fix that. You can see that, you know, the x-axis has changed substantially. So now, you know, 90% of things are, you know, over-estimated by less than a factor of 5. And I bet he annotated it already. Maybe not, but you know, half of them, dramatically, it's better. And you can make it larger table, obviously, and make this as good as you want. Again, as we'll discuss earlier, you've picked the trade-offs. So there's a section here of exercises you can try on your own time. Here's a slightly larger table. Again, you just basically showing you can keep making this better and better as your needs require. So, some ideas. As we've talked about, this count-man sketch is a biased estimator. You could experiment with techniques to adjust the estimates and try and correct the bias. And there are papers written about this online and you can try to come up with your own. You know, when you pair it with a priority queue, the count-man sketch can be used to track the top K, as we described. So you could actually try getting a, you know, priority queue package with this to give yourself a top K notebook. And then there's, you know, clever problem-solving puzzles. You know, could you handle negative values when you're inserting? And how could you use minimum, you know, sort of like in the same way that the Bloom filter uses actual site intersection? And with that, we will move on to the next sketch. All right, so this is the problem of counting distinct items. So this is where you have a set that's so large that you can't keep the set around and you want to know how many things are in it. So if you wanted to just count a multi-set, that's really easy, right? Start keeping it integer. It's very large. But if you want to count the number of distinct things, that's a little more interesting. So, again, precise approaches, maintain an explicit data structure and see how big it is. But these don't do us a lot of good when we have to deal with something that's too big to have an explicit data structure. You can actually estimate cardinality with the Bloom filter. I'm just going to tell you that this is possible. I'm going to show you this code and I'm going to tell you that you can search the internet for a paper by Swami Das and Baldy and I don't think you should care because we're about to learn a better technique. But it is sort of one of the other ways that the Bloom filter is a cooled data structure that punches above its weight. You can use it for almost anything. The technique I'm going to talk about is called hyperlog log. And this is, we're going to focus on the intuition behind hyperlog log because the intuitions are a little less obvious than they are for the first two, right? It's easy to think about hashing. It's easy to think about the minimum. But my hope is that after we look at this part of the talk and the notebook for hyperlog log we'll have a better idea of how it's working. It's a very cool technique. So let's say you flip a fair coin, lands with the reverse facing up. What do we call this side of the coin in check? The face? This is the face. Pala. So say it lands with Pala up. Pala? Palaan? Yes. Pala, okay. This is 10 crowns. It has a picture of Bernaud on it. Come on, guys. Okay. It's not surprising, right? We get the picture of Bernaud. We're not surprised. We get four of them in a row. Are we surprised yet? Maybe surprised but not shocked, right? You've got a one in 16 chance. We get 64 in a row. I'm reaching for my wallet at this point, right? Like it's not a fair coin, right? It's extremely unlikely. It's a one in 18 quintillion chance of getting this result, right? So what do coin flips have to do with that cardinality? Well, we can think of a sequence of coin flips as a binary number, right? So let's say we have a source for uniformly distributed N-bit numbers. Each of the bits in these numbers is like a coin flip. Bits are all independent because the numbers are uniformly distributed. Each one is independent of every other. It's equally likely to be one or zero. So the probability of seeing an arbitrary uniformly distributed N-bit number that begins with N-zeros is one in two to the N. Just like the probability of getting N-panas. That's not the right plural. In a row. We can estimate that if we see a number with N-zeros at the beginning that we're likely to have seen two to the N numbers overall. And the reason why this is the case, let's go to another cumulative distribution, shall we? Have we had enough cumulative distributions today? I don't think we have. We definitely have not had enough cumulative distributions. If you've had enough cumulative distributions you're in for disappointment. So every time if you think about the space of all possible numbers every time you add another zero to the beginning of one of those binary numbers you're cutting the number of things you have left in half. Because you have two options. You can be either zero or one. Every time you do zero instead of one you're cutting that every time you do zero instead of zero or one you're cutting in half. So if we look at these uniformly distributed numbers and we look at the cumulative distribution of ones that have a certain number of leading zero as we see that every time we add another leading zero we cut the number of things we have in half. The number of things that are left in half. So we have a technique actually for estimating the number of distinct uniformly distributed random numbers that we've seen by counting the largest number of leading zeros. Cool story bro. Right? Yeah. What is it good for? Like do you want to count numbers? I mean like the next time I'm at a party and I want to like impress a crowd I'm going to say well how many numbers do you think you've seen if you get this one? No. Counting numbers is not that interesting right? We want to count IP addresses we want to count unique search queries we want to count arbitrary objects we want to count anything. If only we had a way to convert arbitrary objects into uniformly distributed random numbers. If we had a function that would map from arbitrary objects to uniformly distributed numbers then we could do this right? Can we think of such a function? A hash. Yeah we have hashes right? If a hash is good changing any bit of the input is equally likely to change any bit of the output. If a hash is good the bits are going to be independent. So we can hash arbitrary objects and turn them in to n-bit numbers and then estimate how many things we've seen by keeping track of how many leading zeros we've seen. It's true. I mean or to wait until 321 this but that's the end of the conference here as long as we... So the problem with doing this just like we talked about though is if you count the zeros you have an extremely high variance because every time you have one of those additional leading zeros you're doubling your estimated count. So you don't want to have a technique for estimating the counts of multisets that only gives you powers or the counts of distinct elements in a set that only gives you powers of two. You don't... Yes, not a good result. Right? So we can use a technique to smooth out that high variance and the technique that we're going to use is we're going to take we're going to take a bunch of these and we're going to average them out. Now because each of these is a rate we're not going to use an arithmetic mean but we'll get to that in a second. So what I'm going to do is I'm going to take the first few bits of the hash value and I'm going to use those to select one of a set of counters and then I'm going to take the number of leading zeros from the second half and I'm going to update that counter and I'm going to keep track of the maximum number of leading zeros I've seen. So now I've seen one thing I've updated one counter with two leading zeros. As this goes on I'm going to update more and more things here I have zero leading zeros so I don't update it at all and it's going to look more like this. What I'm going to do is take the harmonic mean of two to the power of each of these counts instead of just taking two to the power of one of these counts which is only going to give me estimates that are powers of two and I'm going to get the answer roughly 16 and the quick one sentence reason why we do a harmonic mean is that we're essentially talking about rates here and the harmonic mean is appropriate to see someone who's advertising something to make your programs run faster and they give you the arithmetic mean of the speed-ups they offer you should check to make sure that your coin is not landing with the picture of Bernaud on the top every time. Reach for your wallet. So the cool thing about these sketches is we can combine them just as we did with the countman sketch or with the bloom filter bucket and these are also very very straightforward to code and I think actually I just sort of want to go back to the notebook how do we feel about that? All right so we'll go back to the home and we'll click o3 hyper log log and you can actually see what the code looks like I'm just going to say run all cells so that I can sort of talk about this basically the hardest part of this code is doing bit manipulation in Python and counting the number of leading zeros right? Beyond that it's pretty simple so I'm going to again if I look at the number of leading zeros in a set of 32 bit random integers I can see that nearly all of them have fewer than 12 leading zeros I only got 4,096 numbers there I'm going to again use some tricks to do get some hashes but the actual hyper log log code itself is pretty simple I have this collection of registers which is just these counters of how many bits I've seen how many zeros I've seen leading zeros I've seen sort of combine these counts from each register with the harmonic mean which I'm importing someone else's implementation from so I create one of these I put 20,000 random 64 bit integers in it I'm hoping that again that my source of randomness is not giving me the same thing in these and my estimate is 24,000 which is not awesome but it's not bad either right if I gave it way more than this which we're not going to sit around and wait for it to do because it's running in Python on someone else's computer we would get a result that was closer as well and we can see how we can add these together the exercise to try at home is to do this yourself and sort of convince you're making an intuitive argument that it works the same way I think that I would really recommend trying and we can actually see this again so we get a better count the next time we can try 200,000 let that run that's quite an overestimate but anyway you can read this paper Hyperloglog in practice which sort of details how Google took this algorithm and used it to be made, tuned it to be really useful these cardinality problems that they have at scale alright so I have one more thing that I want to just sort of touch on and then I'm going to hand it over to Eric to talk with the T-digest which is a really cool data structure for estimating approximate cumulative distributions so again yeah I'm sorry cumulative distributions we can tell who the cumulative distribution superfans are still here so just briefly let's look at this problem called upset similarity and I want to motivate this with a particular application area and that's looking at similar documents I taught undergraduate computer science quite a bit in a past life and one of the real frustrations with teaching undergraduate programming is you have to come with a new project every semester to stop people from rampantly cheating on your assignments so plagiarism is a big problem plagiarism is a big problem in the humanities as well sometimes it's that students don't attribute properly sometimes it's that they just cut and paste and represent someone else's work as their own because on purpose but let's say that you're someone whose job it is to you want to say whether or not something is uncannily familiar to something else the binder should disappear after some period of inactivity like an hour I think you have it's a lease basically if you go to the original URL again there is also a GitHub repo you can run locally if you want that's a good question and I should have called that out I guess so let's say that you're a literary agent and you get a new manuscript that starts with this catchy pair of sentences but it's a little clunky you think you may have seen it somewhere before I don't know does this look familiar to anyone this should look familiar because it's a lightly edited copy of the first two sentences but if you didn't already add that U there but if you didn't already know that you'd have a hard time figuring it out and plagiarism detection is super interesting both in human language and in programming languages but there are a lot of other interesting problems that are related like if we wanted to say which of these news articles are similar which of these news articles together in a search query so if we wanted to solve this problem precisely well we could certainly represent documents as sets of words and then we could take the what's called the jacquard index of two sets and the jacquard index is just a measure of set similarity and this is very simple basically we take the size of the intersection which is not going to be so we take the ratio of these two and we say it's basically the size of the intersection divided by the size of the union so you get in this case these are to our 3 tenths similar for example so they're not particularly similar but we could use this with sets of words we could use this with any number of techniques to find similar documents but the problem is that even though this technique is very easy maintaining those explicit set representations is very expensive I don't want to keep a document around and also an explicit set of all the things in that document and I certainly don't want to do this fast linear time operation for every pair of documents I might ever care about if I wanted to do a linear time algorithm without the set of all five character substrings and a large document that adds up and if I have 10 million documents I'd have to calculate and sort 50 trillion jacquard indices in order to find the most similar documents so I don't want to deal with that no one wants to deal with that so let's go to the notebook and see how this minhash technique works very quickly so let's go home again and we'll go to minhash and again this is just sort of a quick flavor of this we want to turn it over and talk about the t-digest pretty quickly here so basically the idea is that you solve the jacquard index the problem that the jacquard index takes a linear time by calculating a signature of a set so that the signatures solve the problem of doing all the pairwise comparisons by using a technique called locality sensitive hashing so that you have a way to filter down and only compare a subset of all the documents that are most likely to be relevant so I actually really want to turn this over to Eric so we can talk about the t-digest but I would say if you're interested in this technique this notebook again very simple fits on mostly on a single slide not exactly in 80 characters how we can test it to see how well the implementation works and then use locality sensitive hashing so that we don't have to do pairwise comparisons solving sort of both sides of the scalability problem with set similarity and then there's a link to a free online book chapter where you can learn almost anything you would want to know if we defer to more cumulative distributions now alright excellent thank you so much now I need to get out of someday I'll stop being confounded by full screen mode this thing is playing I guess the question you could ask is individual applications and sketches today overall it's like you might say why sketching data science dog here is also curious as we've seen you can basically create representation here data that is much much smaller it's also much much faster to manipulate and faster to like them and at the same time it preserves all the essential features of your data or what's essential to the application you have at hand I also hope that you come away with an intuition that we're all data sketchers as Will talked about if you've ever taken the mean and variance of some data you've done data sketching if you cluster your data the cluster centers are a sketch of the data that you have and even if you do machine learning there's even a theorem learning is a kind of data compression and so if you train a learning model you're actually essentially sketching the data that was in your feature vectors and training samples so data sketching is really all around us so the last sketch we're going to talk about T digest it was introduced by Ted Dunning and Omar Hurdle and that's the title of the paper up there if you want to look it up it's got implementations in a whole bunch of popular languages and now Scala which I did and there's also libraries that integrate these things with spark and pie spark so you can use a defined aggregators and spark and do these sketches so what exactly is the thing sketching if you have a stream of numbers coming from somewhere any kind of numbers it's going to give you an estimate of the cumulative distribution function of the data that it saw so this little chapter is going to be all about estimating cumulative distributions and things you might be able to do with that so what is a cumulative distribution or a CDF if you have data with some kind of density which is of course the probability density function usually half of banks the cumulative distribution is basically a measure of the probability that's to the left of that so in this case that black dot is basically a representation of this area little tail back there because it's always the measure of something that's to the left of it it's monotonically increasing as you see if you move X to the right over the distribution of your data this number the area just keeps going up and up and eventually you get to one and after that nothing changes because that just encompasses everything so those of you who are like savvy might be asking it's like well if that's nice that I have an estimate of my cumulative distribution function they're like what if I actually want what can I do with that so it turns out that this is actually coming right off the fundamental theorem of calculus I know you all remember that but if you have the cumulative distribution function and you take its gradient you get back to density so it turns out that even if you just have a sketch of your CDF you also have a way to get the density function in five ways you can actually do that so what are some properties of this so like most of these sketches we've all seen it has a way to do incremental updates because it sees new data you have a new data comes in and you get an updated D digest that represents having seen that piece of data so why does that help again most of these applications are designed to work with very very large data or streaming data coming at you on this and you can take that and maintain a compact running sketch of it essentially compact space so what's the payoff of that I mean let's just imagine you've got like a rest service running these days in Kubernetes or OpenShift and you're measuring query latencies so your users are hitting your system and you're interested in knowing what kind of query latencies they're experiencing and of course if you're somebody like Netflix you're getting like a massive stream that basically never ends of queries and corresponding set of latencies and so you might be interested in like the distribution so suppose you take a sketch and you get your CDF what kind of questions can you actually answer about your data you can ask like you know service level agreement questions you can say hey are 90% of my latencies under one second or whatever you know whatever is considered what you want to provide for your users interestingly enough you might also want to simulate latencies and I'll show you how you can do clever things with these sketches to actually not just describe your data but simulate data with the same distribution randomly so what's another payoff you know back in the day of course data just was like a monolithic thing that lived on your system and data was small and it was basically easy very easy to like take that data and generate a sketch but these days of course we have very large data sets they reside across many systems here you can see a cluster kube running on apple 2 ease you know it's easy enough to see how you can take a sketch of the kube of the distribution for each of your partitions but you know now you've got a bunch of these things what can you do can you merge them like we merged all these other things so to answer that question I got to tell a different story so how do these things actually represent a sketch of a distribution and under the hood all it is is a list of clusters you know basically in the space of the numbers representing and a mass that basically represents kind of like locally how much data landed near that cluster so you'll see me use x and m as like annotation for that and you know you can see that you know it represents these things where you have like where the slope is you know small like often the tails you have are farther apart and in the bulk near the mode of your distribution where things are denser you've got more clusters and they're larger and so that's really all the intuition you need so when you picture this thing just think of like a just think of an object that's maintaining a list of these clusters that are just nothing but number slash mass frequency pairs steep slope so let's go back to this question you know you have data living in some partitions you can sketch your data and what you've actually got is a bunch of these clusters and so you know the way you merge them is you can simply take them and run the update logic for each of the clusters into another one I'm massively waving my hands at how this works because we deliberately decided ahead of time that we didn't have time to talk too much about it but if you're curious about the real details of all this logic works there's code in the notebook you can study I'll warn you ahead of time I think there's some bugs still living in that code it's in the notebook so somewhere in Boston and a similar one it's Spark Summit a year or two ago where I do nothing but talk about the T-Digest and so if that's something you'd like to drill down on I encourage you to check out those talks so I can tell you more about them offline or you can just Google my name and T-Digest and you'll probably find it the main point is like all these other sketches you can merge them and I will talk about one last clever thing you can do there's a trick you can do with any cumulative distribution function it's called inverse transform sampling which is a fancy sounding thing for something that's not that hard the range of any CDF is of course from zero to one because it's just basically summing up the cumulative probabilities it's number less than zero and ever greater than one so on the y-axis you can basically take a uniform distribution sampling between zero and one that orange dot there landed somewhere and on the curve you can find the value of x that corresponded to that point so this is basically why it's called inverse transform sampling you're taking the function inverse and so that value is equivalent to a value randomly drawn from the actual distribution so as long as you can generate uniformly distributed numbers and basically of course you can do that easily on all systems these days you can actually simulate the distribution of your data using this trick and again the packages that I provided give you this ability you don't have to figure out how to do it yourself so the main point is these models are not just descriptive and that's can be a very powerful thing like gave another talk about like what you can do with that so we have ways basically to sketch data and then now we can basically turn the crank in the reverse direction and simulate our data with our sketch and that was I believe that's all I wanted to talk about there we'll just dive right to the notebook we have ten minutes I can do it you can with the caveat that generally speaking they can only generate the marginal distributions so you cannot you cannot use this to simulate the joint distribution of your n-dimensional vector but you can easily easily simulate all the marginals as you can generate a sketch for every single feature column and then simulate we'll just I could actually run maybe I'll do this thing because we're running out of time we'll do the thing where we just run everybody so as with all of them we got basically a implementation of t-digest that's right inside the notebook like I said I do believe that there are some bugs in this but it'll give you it'll definitely give you simple than the ones we showed before if there's a less appealing property about this actually getting all the bookkeeping details right was a little trickier of all of these this is probably the toughest one to actually get working on your own so let's look at visualizing I've defined all the functions for it's basically plots the CDF using t-digest oh something more up here actually took the sketch basically by the amount of scrolling I have to do it's not the most compact implementation and yet you can fit in a notebook so it's not thousands of lines of code so here we declare okay here we are doing the same thing we usually do take a t-digest you can set the compression on these things so you can again this is how you trade size versus fidelity in these sketches here's my distribution with 100,000 samples you know I use gamma because it has a shape sort of like the kind of shape you get if you're measuring query latencies and stuff like that so first visualizing this is just a thing I can use to plot the cumulative distribution so here's my CDF and you can see that if you sort of like run the median about 50% of these things that's really not a spectacular performance from a query latency perspective but you get the idea all of them are basically less than or equal to 25 and you can get quantiles so here I'm actually getting the number it's like what's the median or the .5 quantile it's about 7.5 the 90th percentile is around 11 and if I want to look at what everything is it's pretty close to almost everything is less than or equal to 20 so here's inverse transform sampling I just take the uniform thing and I take then the inverse of the CDF of that value in this case I wrote it so it returns a bunch of stuff so I ask for 10 of these things so here's 10 samples that are simulated from the sketch and so if I did this how well does it actually work I'm going to use this and generate a large sample size and I'll take the and you can see it looks exactly like the data that we actually the real data so it's actually a very good simulation and that was all I wanted to talk about and we have five minutes I'd like to echo and emphasize something Eric just pointed out which is if you have a phenomenon that you can observe that generates a single metric every time you can observe it you can use this to faithfully simulate something that looks like that phenomenon not to put too fine a point on it like that's really cool yes and you can do clever things like you know if you've ever tried to do random forest clustering one of the things you do for random forest clustering just take your feature columns and randomize them and you can work with that or fit it in RAM and so it's easy to randomize stuff that's in RAM but once you start scaling out how do you do that? It's all but impossible but if you do what I told you it's like if you sketch each of your columns and then start drawing random numbers from them it's like turns out that you can scale random forest clustering so it turns out this is a technique for doing true scale out random forest clustering which isn't as popular as it used to be but you can do cool stuff with this I think most people care about quantiles but you can also do anomaly detection if you start seeing you can look test your data if you can basically sketch the distribution and you start testing it and you're getting a whole lot of values that are like showing up with 0.999 many times in a row it's kind of like Willbe's fair coin flip it's like that's suspicious you shouldn't be seeing tons and tons of data that's way out on the extremes or way down here so it's also basically a nice tool for doing quick and dirty anomaly detection on data that fits in RAM how does it compare to like ordinary content so you're saying if our data says where I can afford let's say a traditional approach I think it's actually it compares pretty favorably it's still smaller and because it does the because you can do scale out parallelism even if you're not across multiple machines parallelize the collection you can of course you need a modern machine it's like even if I'm on one laptop I've got eight cores what's that? accuracy numbers in the paper there are yeah exactly how fidelity how much fidelity there is in my testing it's like pretty good especially if you keep the compression relatively low it's that's the cumulative distribution if I plotted the actual CDF of the gamma it would look like that like if you did a Kolmogorov-Smirnov test I mean well the Kolmogorov-Smirnov is going to give you probably zero that they're the same because you don't have a large enough sample but if you just take the d-value the d-value is small it's like 0.1 it doesn't differ from the real one I was very small for action at any point in the curve so any other questions? when do we start working on the notebooks? so the notebook is always going to be there you can always go to the URL and experiment with this stuff and find bugs in the stuff that has bugs it's been a pull request you don't have to fix them I do have a I mean I have an implementation of this which I have more confidence in because it's been unit tested and I spent a lot more time on it I have a question about why is to me the first part it was quite similar to the Kolmogorov-Sketch yes why is it the first part? I don't know if there's any reason a naming of things nothing's rational it's a good question I mean I I think I may have heard this phrase generalized bloom filter before but it's the guys who wrote the paper decided to call it Kolmogorov-Sketch alright, thank you very much for coming thank you very much for your help