 Okay, so yeah, I'm going to be talking about probabilistic structures for scalable computing, but so I'm always interested in why people get interested in different techniques. And I want to, you know, this talk is about big data, but I want to tell you about the problem I ran into where I realized that big data was interesting. The problem where I actually needed big data in order to get work done, right? Like, we're talking about scalable computing today. How many people think of a scalable algorithm as being something that runs in linear time? Anyone? How many people think logarithmic time, once you're less than linear, you're scalable? Yeah, that's pretty good, right? So I'm talking about techniques today where linear and logarithmic aren't good enough, right? And the problem where I ran into these first was in calculating mean and variance estimates. So you probably already know how to calculate mean and variance estimates for small data, right? There's a textbook technique that works really well. And basically the idea is you have a bunch of samples and you sum those samples and then you divide by the number of samples and you get a mean. So we're representing the mean here by this light gray box behind each of these samples. When we want to calculate the variance, we look at the difference between each sample and the mean and whether that's positive or negative winds up not mattering because we square those differences and we sum them and then we divide by one less than the number of samples. So we get a variance there and that's great. But in a lot of situations, you actually can't make that second pass over the data set. And the problem where I ran into this the first time was dealing with architectural simulation for computers. So if, as anybody here done any work with computer architecture before? So computer architects are always interested in evaluating the impact of decisions like when do we keep things in cash, like how do we predict whether or not a program is going to take a branch or not. I was designing compiler optimizations, but I wanted to run an architectural simulation and look at how a simulated processor was dealing with a program. Now when I was doing this kind of work, these simulations would take days to simulate seconds of wall clock time. So I wasn't just going to replay the simulation over again so I could get a variance on my estimate, right? I also wasn't going to be able to keep the data around from the simulation. I needed to be able to process things online. And I had never really thought about meaning variance estimates except to think about the textbook method, right? So when I saw that there was a way to do these things in place, it really seemed like magic. What it looks like is this. Instead of examining all the samples at once, you examine one sample at a time. You look at that first sample and, well, you've seen one thing, right? So that's your mean. It's really easy to estimate your mean when you only have one thing, right? It's average. When we look at the second thing, and we don't have a variance yet because it's undefined, when we look at the second thing, we look at how different it is from our running estimate of the mean, and we sort of take a weighted average. We say, well, I've seen two things. The difference between what I thought the average was and the thing that I'm seeing now is that blue bar. So I'm going to divide that blue bar by two and add it to my estimate of the mean. So it looks like that. And once we have two samples, our variance is defined, so we have a variance estimate as well. And this continues so on. With this next one, we'll take a third of the difference between our estimate of the mean and the sample we're looking at, and so on. And if we look at it, we can see how it goes through the whole data set over time, right? We have these red and blue bars to show whether we're adjusting the mean estimate up or down and how much we're adjusting it by. And those dark gray bars on the bottom, which are not to the same scale as the bars on the top, are our estimate of variance. The light gray boxes behind the mean estimate on top and the variance estimate on the bottom are the final result. So we can see that at the end of our run, we've actually gotten to the actual mean for the whole population when we move it down by that blue amount. And our variance is actually very close to the actual variance for the whole population as well. So there's a way to do this online. That's really cool. That was really cool to me. It seemed like magic. So you may be saying like, well, this is not a data structure. It's just an algorithm. It's not for scalable computing. It's just sort of for analyzing data. Well, the interesting thing about this technique is that it has a few properties that all of the other techniques we're going to look at today have in common, a few really important properties. And the first one is that it's incremental. You can take these and update it one at a time. But this also has a really cool property that you can take estimates for subsets of your stream and add them together. So if I have half of my data with one estimate and half of my data with another, I can add those estimates together and get estimates for the whole sample. So this is a really important property for processing big data along with being able to process things one at a time. So this first property that we can have an incremental estimate, we can update by looking at a sample one at a time without replaying things. That's really important. The second property that we have a parallel estimate, if we take two subsets of our data and combine them together, we can get an estimate for each subset by combining the estimates we had for each subset. And then the third property that this technique has that all of our techniques are going to have today is that it's scalable. Whether we're examining one sample or one trillion samples, our estimate is going to be the same size. We're going to have a compact summary. So linear isn't scalable enough, logarithmic isn't scalable enough, constant is scalable enough. We need to have a way to be able to say our summary is not going to grow at all so that we can process really a lot of data at once. And we'll have to be able to choose between how much precision we want and how much space we want to use, but we'll have those opportunities. So I hope that in the rest of this talk, you'll learn about some structures that are new to you and learn about some ways, new ways to use structures that are not new to you. And I hope that some of these will seem a little bit like magic to you if you haven't thought about them yet. So I don't want to spend a lot of time diving into code here, but I do want to emphasize that the basic code for this technique fits on a single slide. We're not talking about any total wizardry here. We're talking about things where the basic implementations are really, really sort of simple things you could write in Python. And I'll have a link to a notebook at the end of the talk where you can play with all of these. So the first problem I want to look at with a probabilistic structure is the problem of set membership. And set membership is really a fundamental primitive in a lot of data processing problems, but it's also a really interesting problem as a thing in itself. And let's look at an application to see why. Let's say you're writing a web caching service, your content distribution network. If you cache everything that you get in a request for, you'll be able to satisfy subsequent requests for those things very quickly, because you'll have stored them in your fast machines that are close to your customers. But a lot of things on the web wind up only being requested once. And it doesn't make sense to cache them. So there's that long tail. And as you continue to get requests for things, your cache will continue to grow. Maybe eventually you'll have to sort of evict some things that you've cached, because you don't have room for them anymore. It will be much better if we could say, well, we're pretty sure that things that are requested a second time are going to be requested a third, but we're not sure that something that's been requested just once will ever be requested again until it's requested a second time. We sort of want a way to know if something should be in the cache before we put it in the cache. We want like a cache for our cache. So if we had a way to keep track of the set of all the things that people had asked for, we could know whether or not we'd seen a request for something before or not. And if we had, we could say, yeah, we'll absolutely store that in the cache because it's at least the second time someone's asked for it. But if we didn't have knowledge that someone had asked for that thing before, well, then in that case, we wouldn't store it yet. Now, if we're keeping a set of everything we've ever been asked for, that's basically a cache. So we want to actually have an efficient way to approximate set membership. And if we had a structure that was much smaller that could say, maybe you've seen this or you've definitely not seen this, then we could solve this problem and only cache things that we'd seen the second time. Make sense? So this is not just an abstract example. This is something Akamai did, and they went up saving a lot of disk space and a lot of compute and a lot of money doing this about 20 years ago. But the structure we can use to support these kinds of queries is called a Bloom Filter. Before we talk about a Bloom Filter, though, I want to talk about how we would solve this with a precise structure, with something that scales up linearly in the number of objects that we can't use for actual big data. So the first kind of way we could solve this problem is with an array, right? If you don't have very many things in your set, you want to check if something's in a set, you just have an array. You don't bother sorting it because it takes too much effort to sort it if it's just a few things. If you have elements that are gonna be in your set that you have some kind of ordering, you could use a tree, right? You could store everything in a tree and then you have sort of logarithmic time to look up and see if something's in a set. The space is a little worse than an array, but it's not so bad. If you want to be able to have a set of arbitrary things, you could consider using a hash table, right? Where you hash from keys, which are the values you want to keep in your set to any number, anything, right? It could be one, it could be true, whatever. So recall how a hash table works. When you put something in a hash table, you compute a hash function on the thing you want to put in the hash table. That gives you the index of a bucket, which is where you're gonna store something. So we look up the hash value for foo. That tells us which bucket we're gonna store something in. And in this case, we'll create a linked list of key value pairs. So we're gonna store the key value pair foo and one, indicating that foo is in our set. Now, after a while, our hash table will fill up and we'll have to do something to handle the case where we have two things that are hashing into the same bucket, right? So there are a few different strategies you can do for this, but they all involve taking more time and taking more space. The thing we could do here is we could just say, well, if we're gonna have something else that hashes into the same bucket, we're gonna wind up appending something to that list. Make sense? So we have a precise structure, but we have this trade-off between having a precise structure and using more space, and in this case, also taking more time, because if we have to go through that whole linked list to find the things we care about, we'll be processing a lot of elements one at a time. So the bloom filter is an approximate structure, a probabilistic structure, that lets us solve this problem with a constant amount of space very quickly. And it's also based on hashing. So let's see how it works. The idea is that instead of storing key value pairs, we just have a hash function that hashes into a bit vector, true and false. So I hash foo, and I get that fourth bucket there, and I'm gonna put a one in it to insert foo into my bloom filter. Now, just like with a hash table, I can have collisions, and I want some way to mitigate the impact of those collisions, right? If I have two things hashing to the same bucket, I don't wanna always assume that I have that thing in my set. So what the bloom filter does is it uses multiple hash functions. So in this case, we have three hash functions, so we've set three bits to true. If we have some other thing that we're gonna insert in, well, those may set those buckets to true in a way that overlaps with something else we've already inserted, but they probably won't overlap on all three hash functions, right? Ideally, your hash functions are independent, and one of them may collide, but all three of them probably won't for different values. So what this means is that when we wanna look something up, we use all three hash functions in this case to see whether or not those bits are set to true. Now in this case, we've looked up foo and we see that all three of those bits are in fact set to true. If we look up bar, we see that all three of those bits are set to true. Let's look something up that's not in the bloom filter. Well, we're gonna look up the ug and we're gonna say that it hashes to these three buckets. Well, in this case, we have two of the buckets are set to one. We had two hash collisions from our three hash functions, but one of them is set to zero, and that means we know for sure that ug is not in this bloom filter. But we could look something up that we haven't put in the bloom filter and just sort of by chance get collisions on all three hash functions. And in this case, we would say, yes, if we look up blah, this value that has collisions on all of the hash functions, we're gonna say that is in the filter, but we don't know for sure. And in this case, it's a false positive. So with the bloom filter, your no is definitely no. If you say, is this something I've seen and you say no, you can be sure that you haven't seen it. If you say yes, though, that's really a maybe. It's not a definitely yes. And as we can see, as with the mean invariance estimates, this also fits on a single slide of code. We have a bit vector. We're just hashing into it with several hash functions, both for insertion and lookup. Very simple code. We can do something really cool with bloom filters though. Since we have bit vectors, there are a lot of operations you can do with bit vectors, right? And if we have a bloom filter that we've just inserted foo into and one that we've just inserted bar into, we saw how the lookup worked. We know that if we look up foo in the foo filter and bar in the bar filter, we'll get three ones, right? But we have bits and we can combine bits in interesting ways. If I take the bitwise or of these two filters, I can actually get a bloom filter that's a precise bloom filter for the union of foo and bar. So in this case, I'm able to combine these filters together. This is really important because it means that if I have a data set that's too large to process on a single machine, I can divide it into partitions. I can process each partition on a separate machine and I can combine the bloom filters for each partition together to get a bloom filter for the entire data set without having to hold anything in memory more than one of these partitions, which could be very small. Now, if you're algebraically inclined, you might be thinking, well, if I can do something interesting with bitwise or I can probably also do something interesting with bitwise and yeah, yeah. So in fact, you can. And if or is union and is gonna be intersection, right? So you can get an approximation of the bloom filter for the intersection of two filters with bitwise and. Now, this is not gonna be as precise as if we took the intersection of two sets and made a bloom filter from that. It'll have a higher false positive rate. That's a difference with the union which is gonna be just as good as making a bloom filter from the union of the two sets. But this is still really useful to have an approximation for intersection and we'll see why when we look at one of the applications for a bloom filter in a second. So as you can see, we identify correctly that neither foo nor bar is in the intersection of foo and bar. All right, so an extension to bloom filters that sort of decreases your false positive rate under intersection is called a partition bloom filter. And this is also gonna form the basis for another structure we'll look at today. But the basic idea is that instead of writing all of your hash functions into the same space of buckets, you have one set of buckets for each hash function. So in this case, we have a matrix with three rows, one for each hash function. And when we insert something in, we're gonna update the corresponding hash function each row. And again, this is something that's useful because that has a lower false positive rate under intersection. So let's talk about a couple of applications of the bloom filter. The first one from Bloom's original paper in the mid-70s was actually a hyphenation program for a natural language dictionary. And this is really a fascinating topic because it really reflects how our sense of what big data is changes over time. Like how many dictionary apps do you have on your phone? How many dictionary apps do you have on your wristwatch? Today it's really hard to imagine not being able to fit a dictionary in main memory, but in Bloom's day, he wanted to write a hyphenation program and he could handle 90% of the cases with simple heuristics. There's a really easy way to hyphenate 90% of words, but 10% of words, and you know this, especially if you speak multiple languages, are special cases where you can't just use a simple rule and to get the hyphenation right, you need to consult some kind of dictionary that has the word and has the way to hyphenate it. Now in Bloom's day, that dictionary that had that 10% of words wouldn't fit in main memory on the machine that he wanted to run this problem on. So he had to have some way to see whether or not he was gonna hit the disk before hitting the disk. Sort of like having a cache for your cache. The Bloom filter solves this problem by storing only the words that you'll have to consult the disk for. If you check it and it says, yeah, that might be there, then you hit the disk. Worst case, you have a false positive and you know that you can use the heuristic. But in the general case, you won't have a false positive and you'll know definitely that you have to consult the disk or that you can avoid consulting the disk. A lot of problems like this where it's okay to have a false positive but you don't wanna have false negatives. Another application for Bloom filters that's sort of related is the problem of Bloom join. If you imagine a distributed database where you have two tables that are living on separate machines, you could imagine that doing a join like this where you wanted to match for an attribute across these two relations could be really expensive. You'd have to do a lot of communication between these two machines to send back and forth those values. Well, with a Bloom join, if you construct a Bloom filter for the values of X in each of these two relations, instead of sending all of the actual values back and forth, you can send the Bloom filters across the network. And it's a much more compact summary and it tells you right away which rows are not going to be implicated in the join at all and which ones you can thus safely ignore. The last application of Bloom filters I wanna talk about is sort of motivated by the fact that these things are simple enough to implement that you can actually implement them in hardware. And from a high level, really all of the advances in computer architecture in the last 60 years have involved like how do we extract more parallelism out of programs and instruction level parallelism where you can run multiple instructions that don't interfere with another at the same time is very important. But a really interesting research direction in the last 20 years has been thread level parallelism where you say, I wanna execute some threads speculatively to try and get more parallelism out of my programs. And we can see why this is important if we look at this example C program. This is just a C program that updates an array in place. Now, if we had two calls to this function, no optimizing compiler would look at these two invocations and say, I can rearrange these because you'd have no way to guarantee that they weren't pointing to the same memory that they would interleave safely, right? But if you can run these things in hardware, you can speculatively run them and determine whether or not they've touched the same memory and then roll them back. Well, how would we do if they did? Well, how would we do that efficiently? We'd execute both of these things in separate threads and we'd keep a bloom filter for the data that they read and wrote. And if the read set of one of these threads intersects with the write set of the other, we know that we shouldn't have run them out of order and we'd have to go back and start over. So that's exactly sort of why we might wanna use the intersection here to see if there was any intersection in the bloom filter at all. So, we want this trade-off between a good false positive rate and a filter size. I promise this is the only equation in the talk, but you can see that if you sort of fill in some basic numbers about your filter, like the number of hashes you have, the filter size in bits and the actual set cardinality that you're dealing with, you can get an estimate of what your false positive rate is gonna be. So I look at this function and I say, come on, I'm done. I understand it, right? Everyone else? Pretty clear? I actually don't get a lot out of this at all, but if I plot it like this, this is for a bloom filter with 16,384 insertions in a 2,048, so 16,384 bits and I'm gonna keep inserting values into it. The y-axis here is the false positive rate. The x-axis on a log scale is how many things we've inserted. And the thing I wanna call out is that this is pretty good throughout, but the thing I really wanna call out is that when we get to 2,048 elements, that's one element per byte of the bloom filter, our false positive rate is 3%. So a false positive rate of 3%, like what interesting things can you store precisely in one byte per element? Not very many, right? So this is a really powerful technique that punches well above its weight. So a second technique I wanna talk about sort of builds on and extends the bloom filter and this is a technique for counting event frequencies. The most common application people think about today with event frequencies is trending topics, right? If you think about which hashtags or posts or videos are popular on a social network. Obviously there are potentially millions of these things and social networks get a lot of activity. Twitter alone sees like a half a billion messages per day. So there's a lot of problems with counting events scalably and precisely and this is a problem where we need to have streaming algorithms that we can do in an incremental scalable and parallel manner. As before, let's look at some precise structures we could use to solve this problem. Well, we could have an array of event types and counts. Right? We could have a tree of pairs of event types and counts. This may be looking familiar, right? We could have a hash table mapping event types to counts. Well, so the problem with all these things is that they all take linear space to solve this problem precisely. What we really want is a way to generalize the bloom filter so that instead of keeping track of whether or not we've seen something, we keep track of how many times we've seen it and the count min sketch does just that. The count min sketch is a partition bloom filter except instead of holding bits, it holds counters. So when we insert something in to the partitioned to the count min sketch, we look it up with the three hash values and we increment the counter for each. Ultimately, once we've populated our count min sketch with a bunch of counts, we'll want to look up a value and as with the bloom filter, the values are gonna over approximate whether or not we've seen something or how many times we've seen it. When we look up a value in the count min sketch, though we're gonna take the three values we get in this case, however many hashes we have and we're gonna take the minimum of those because there may be collisions, there may be collisions on all three but we will not have seen something more than the lowest number of the counter that we get for one of these. As with the bloom filter, as with the mean and variance estimates, this fits on a single slide, I even commented this code, I have documentation here. It's a very simple technique and again it's just using hashes to increment counters. This has another cool property though, like the bloom filter, you can take two count min sketches, you can add them together. So if I'm processing a partition data set, I can generate a count min sketch for each and add those together just by adding the values in each element. I can also take the inner join of two by sort of the inner product of these two and estimate the size of the join between two count min sketches. So in this case, let's say I'm gonna take all the non-zero corresponding elements and multiply them by each other and add them together, I can get an estimate of the size of the join between these two sketches. So if for example I'm processing infrastructure logs and I wanted to say, I'm gonna put all of the log entries in both of these count min sketches but I'm gonna have one for a particular subsystem and one for a particular kind of log message. I could say, well the number of log messages that came from this subsystem and had this severity is about, in this case, seven. So I can estimate the size of the join between two of these with this pretty simple operation. Another really valuable thing we can use the count min sketch for in conjunction with another simple structure is supporting those cop top K kinds of queries like trending topics. In this case, when we put things into the count min sketch, we're also gonna maintain a priority queue. So when I put foo in, I'm gonna look it up and I'm gonna say, well I've seen foo 20 times. That's the most I've seen anything that I've looked up so far so I'm gonna put that at the front of the priority queue. As things move on, I'm gonna look up more values and I'm gonna put those into the priority queue in appropriate places. Ultimately, I'm gonna have a list, if I keep the priority queue at a fixed size, I'm gonna have a list of the top things that I've seen based on this value from looking it up in the count min sketch. And these are easy to combine if you have several of them as well because you're always able to look up those values. So this is a really cool technique and it scales and it's a useful way to count the top K elements that you've seen, whether that's trending topics or popular videos or popular search queries or whatever. We can see what it would look like though and one of the real advantages comes in how scalable the count min sketch is. If we wanted to think about trending hashtags, maybe just thinking about trending hashtags globally is not an interesting problem, right? Like people in different parts of the world are talking about different things. We want a structure that's small enough to say, I wanna look at each day of the week or I wanna look at a particular geographic region and I wanna be able to say, let's take the count min sketches for Saturday and Sunday and add them together to see what people were talking about this weekend or I wanna see what people in the Americas are talking about. So let's add up the count min sketch for each one of the 35 countries in the Americas and see what people are talking about there. The count min sketch is a scalable way to answer these kinds of queries very easily. The next problem I wanna talk about is counting distinct items. Counting items is not that hard, right? You just, your summary is incremental, parallel and scalable because you just have a number and then you have another number and you add them together and I guess it's a fixed size number but maybe use a floating point number so you can trade precision for accuracy, right? But counting distinct items is a more interesting problem, like how many unique search queries have you seen? How many unique visitors have you seen to a popular website? If we wanted to do this precisely, we'd need to build a set and we'd need to take the cardinality of the set but we already know that doing it precisely is not gonna be scalable, right? So we want another way to do this precisely, to do this with an acceptable error rate and a fixed amount of space. There's actually a cool trick we can do with the Bloom filter to estimate set cardinality but we're not gonna talk about it today because this is a better technique called hyperloglog and to build an intuition for what's going on with hyperloglog, I wanna talk about coin tosses. So let's say you toss a coin and it lands with the face up. Are you surprised? Probably not, right? Say you toss a coin four times and it lands with the face up four times. Are you surprised now? You might be surprised but you think about it and you're like, I have a one in 16 chance. I spend a lot of time thinking about coin tossing and it's bound to happen sooner or later, right? Now, if I toss a coin 64 times and I get the face every time, I'm starting to, maybe I'm looking at the other side of the coin and making sure that it doesn't have two faces on it, right? There's something that's probably not fair about this coin because you have just a vanishingly small chance of actually getting 64 faces in a row with your coin tosses. Similarly, if we think about uniformly distributed random numbers, you can think of each one of those numbers as an independent coin toss, right? Each one of those bits in a uniformly distributed random number is a coin toss. So here I have some numbers and the first one has no zeros at the beginning. The second one has five zeros at the beginning, just like getting five of those faces in a row. The third one has one face in a row, one zero, and the fourth one has two. Now, in each of these, the probability that we've seen five zeros in a row is the same as the probability that we've flipped a coin five times in a row and gotten five faces in a row. So in this case, it's one in 32. In general, it's two to the n for each of these. So the probability that we've seen one face in a row is two, right? And the probability that we've seen zero is high, right? We can estimate how many things we've seen based on the probability of the kind of things we've seen. And maybe talking about coin tosses isn't that convincing, but let's look at the cumulative distribution of the number of leading zeros in a bunch of uniformly distributed random numbers. And we see that 50% of uniformly distributed random numbers have no leading zeros, which is what you'd expect if these are actually independent, right? 75% have at most one leading zero and so on. We're taking away half of the space of random numbers every time, right? Our probabilities are getting smaller and smaller. So you may be thinking, well, okay, so I have a way to count uniformly distributed random numbers. Great, that's like a magic trick, right? Like it doesn't, it's nothing against magic tricks. I've seen a lot of great magic tricks today even, but there is nothing that solves our problem about counting things that I actually wanna count like unique visitors or search queries or just arbitrary objects from being able to count how many random numbers I've seen. Unless I have a way to turn these arbitrary objects into random numbers. I do actually have a way to turn arbitrary objects into random numbers. I have hash functions, right? So I can actually use hash functions to turn arbitrary objects into random numbers. And then I can count the number of zeros at the beginning of those numbers and figure out how many of them I've seen. And this is how hyperlog log works and we'll dive into how it works. So I'm gonna take the hash function for an arbitrary object and I'm gonna split that hash function into two parts. The first part I'm gonna use to index into one of several registers. So in this case, I'm gonna pick the third register from the right there. And I'm gonna say that this is the number of leading zeros I've seen with the rest of it. And so I have two leading zeros there, so I'm gonna update the counter to the maximum number of leading zeros I've seen for that register at that point. The idea is that we're gonna sort of spread this out, right? So in the next one, I'm gonna have this second from the left here and I have no leading zeros so I'm not gonna update the counter at all. After some time, we'll get some counts of leading zeros in these registers and we'll wanna estimate how many things we've seen. Well, each one of these is like two to the value is one in two to the value is the probability that we've seen. So it's sort of like these are all probabilities, right? Two to the one, two to the three, two to the second and we're gonna wanna average these to get the estimate and because they're probabilities like rates, we're gonna wanna use the harmonic mean and this gives us roughly 16 elements in this case that we've seen 16 elements. So being able to do 16 elements is actually not that impressive, right? You could count that with a precise technique but this is the example we have on a slide. If you're using hyperlog log in practice to actually count a lot of things, you can have on the order of thousands of these counters with a 32 or 64 bit hash and count really quite a lot of things in the trillions with really great accuracy. So I would encourage people to check this out if you need to do this count distinct problem. Hyperlog logs also have the properties we talked about with the other structures that you can combine them together simply by taking the maximum corresponding element in each register so that you can compute these on individual partitions and combine them together and obviously they're scalable because you have a fixed size structure that doesn't grow and as before, I'm importing the harmonic mean from sci-pi, sorry but this still fits on a slide. This is still a very, very straightforward code to implement a basic version of this. If you're interested in learning more about how to use the hyperlog log in practice, I'm pointing to this paper which dives a lot deeper than the example code does because this is from putting it into production at Google to estimate cardinality for a lot of big problems. I wanna talk about one last problem that we're not gonna have time to go into in detail and that's the problem of scalable set similarity. You could imagine maybe that you're a literary agent and someone submits you a manuscript and you get this manuscript in your inbox and you start reading it and you think, well maybe I've seen that somewhere before, it sounds familiar, it sounds, I'm not quite sure and depending on whether or not you've spent a lot of time with 19th century English literature, you would know that this is actually a gently plagiarized version of the beginning of Jane Austen's Pride and Prejudice but you'd wanna have a way to answer that kind of question automatically and document similarity, set similarity is an interesting problem not just for natural language but also for programs and it's sort of a basic building block for a lot of other interesting problems as well. How will we solve this problem? Well one thing we can do is we can look at a set either of all the words in the document or a set of all substrings in the document and say how similar are these two things? Well there's a really easy way to calculate set similarity called the jacquard index. If we have two sets here, which I'm representing as bit vectors, we can take, we can calculate their similarity by taking the size of their intersection over the size of their union. Now that's fast, that's linear, it's not scalable but it's linear but the problem is that to actually find the most similar documents we'd have to do this jacquard index for every single pair of documents. So there's a very clever technique called minhash that solves both the problem of doing scalable set similarity and the problem of doing the pairwise comparisons to find the most similar documents over your whole set of documents in a scalable and clever way and I have some code for this in a notebook, I'll encourage you to see that notebook to learn more there about minhash. So to put it all together, as I mentioned at the beginning of the talk I work on distributed systems and machine learning at Red Hat. These techniques are really valuable both in data processing and data engineering and as the building blocks, both for understanding data and training predictive models. But I think these things are also machine learning in a deeper sense, right? I mean you think about what a machine learning model is. It's a compact, actionable summary of data that supports interesting queries that generalizes from examples and what we have here is machine learning by that definition and it also supports a lot of machine learning techniques. But let's look at what we've done today. First we introduced three key concepts that scalable algorithms need to have. They need to be incremental, parallel and scalable and we introduced those in the context of online mean invariance estimates. We introduced the technique of hashing which is fundamental to everything else we talked about in the talk because it enables us to take an arbitrary number of things and map it on to a fixed amount of space. We talked about the Bloom filter which is a really clever data structure to support scalable set membership queries. We generalized the Bloom filter with the count min sketch which enables us to track event frequencies and also supports top K queries like trending topics queries. Finally, we saw the hyper log log structure which is a really cool way to use some properties about the distribution of random numbers to tell us how many unique things we've seen when combined with a hash function. Thank you so much for your time today. Thanks for your patience with a little bit of technical difficulties there. My name is Will Benton. If you scan this QR code on your phone it will take you to that notebook I've been referring to where you can learn more about all of these techniques. And if you wanna reach me on Twitter you can go to at will be. I answer email at willbeatredhat.com. We have a community project called redanalytics.io which is for scale out machine learning and intelligent applications in containers. These sorts of data structures are very important when you're scaling out because they work just as well scaling out and scaling up. And then finally, if you go to chapo.freevariable.com I have a web log there where I write about distributed systems, machine learning and other related topics. Thank you so much for your time. It's wonderful to be here.