 all our streams talking about proclamation hours for counting things. So like I say, counting is easy as one, two, three, five. Counting things is hard. A little bit about me. I lead the analytics machine learning team at Takrada. I like to call it Animal. And Takrada really focuses on making it easy for governments, ranging from New York, Seattle, Chicago, all the big cities, to national agencies like CMS and the White House, to international agencies like the World Bank and the EU, get data into our platform. And it's up to us to make it a way that is the best context for the citizens looking at the data. It's a pretty exciting company. I can talk about it all day long. I'll be in the back of the room if you want to talk. So counting is hard. Counting is really hard. Counting is hard because, well, you have to store a lot of information. If you want to know the number of distinct items in a list, you have to count the number of distinct items. If you want to handle concurrency, you have to write commissions. It can be really tricky, especially when you start looking into it and trying to run these things at scale. But you know, counting is important. I know it sounds silly, but you know, people have to count things all the time. So this is some applications of what I call counting. We've done this before. Service of learning. We have a service. We want to plug and play away to drop something in. So we can query, say, on a rolling window every three minutes, what's the ratio of 200s and 500s for lighting checks? We have a lot of data. Lots of data. We want to recommend filters to people. So they actually thought it was the most appropriate for them to look at. Big data analysis often data is too big to count everything. It's just there's no one computer that's for all the results in one place. And finally, I really motivated this talk is this UX. This is our new UX, just wash element because of the light. I apologize. What you see here, if you could see it, is a bunch of group buys, category cards. And I just come in and they say, hey, Mark, we don't have the data imported yet, but I'm going to ingress. I want you to do two things. Tell me what columns of data should be low cardinality. So they should be a category card. And if it's possible, can we just maybe show a preview of the page before we ingress the data set so people understand what their data looks like? So we have a stream of data. We're not storing anywhere. We just run across it once and try to confuse much information as possible. So in general, I say there are three approaches to counting at scale. And we're going to cover all three in this talk. One is just a hard engineering problem. You just understand code, you understand what you're doing. Other is how you use math. You use hash functions, statistics, things like that. And finally, I call it use ignorance. You just do something and you hope it's going to work. And a lot of times it does. We'll talk about all three of these methods. So here's the specific problems you're going to look at. So this talk is there's hundreds of blog posts, well, maybe tens of blog posts on the internet. We'll compare the details of different algorithms, things like that. We're going to cover maybe four or five classes of problems. We're going to give sort of archetype techniques that people use. So when you see these problems, you can pattern match. You sort of understand different techniques you use. You can understand the papers you're using. That's really, we're not going to get into detail about the efficiency and the correctness. We're just going to sort of talk about the big pictures here. But we're going to have a lot of code. So the first one I'm going to talk about is estimating the cardinality of the stream. So how to find this a little more, but you're going to have some data stream. You're going to know how many unique elements there are. And these problems will progress from more information to the least information return to the most information in return. So this, you have a big stream. All you do is say, I think there are this many unique elements. Example here is the category card. Something's low cardinality, if it's two-number thing. So you can group by it really efficiently in our back end. Another problem is a class of one of those heavy headers are most frequent elements. You look at a stream of data, and you say, hey, these elements appear more frequently than other elements. Or just maybe this element appears more than 50% of the time. Next, we're going to try more information now. We want to estimate the count of frequency of elements and the count of items. So you see a large stream of data, and then you do a query saying, hey, how many times did the word apple appear? So you don't know the items you're a query by a priori. But you still want to give good approximations of how many times you see a particular item. And finally, we're going to end with what's called the top k problem. So now this is returning the elements that appear the most frequently and their approximate counts. You can see that also be used to power that new front end. So if you want to follow along, I recommend that you do. For people on the back side, they'll find slides. This is on a library called Tally Ho because we're counting things. And you find my GitHub, which is splitting fill slash Tally Ho. You'll find there the code, what we're talking about. You'll actually find implementation. One of our engineers wrote yesterday implementation account men's sketch in Haskell. Well, he was interested in it. You also find a link to all the papers we talk about today. Two libraries we'll be talking about, Algebra and Streamlib. Algebra, came out of Twitter. And it's really for doing these type of counting problems for big data. It's really meant for the MapReduce world. And it has a lot of purely functional. It's not mutating. It's a great library, especially in the big data world. And I think Spark now uses Algebra to do it, to distribute accounting. The other came out of ClearSpring. It's called Streamlib. You can find it there. And this came from the Java world. So Java library, highly side affecting. It's also really fast. So what is this talk? This is a discussion. People probably have lost them. I don't want anyone to be lost here before. There'll be some technical things, some non-technical things. So before I even talk at all, I made it a little short. I want this to be a discussion. We can ask questions. We can dive into things together. We can look at code. Please just interrupt me. I won't be offended. And again, this is more of a grad student discussion than an actual lecture to you. And as I like to say, you're all brilliant. So I hope that you like me a little bit better now, because I was complimented you. So let's just talk about our rules of streaming data. When I say streaming data, this is what I mean. So data is presented as a sequence of elements. It comes in. Maybe the types are the same. Maybe they're not. But for now, we'll either pretend all the data types are either ints, or longs, or strings. Each element can be processed once and only once. OK? And you can only store data structures, which are constant size of memory. So if you have two billion unique elements and a 20 billion string, well, make those variables so you can say they're not constant. You can only store a small amount of memory overhead. So before we talk, with this one data structure that I like to use to stitch together streaming computations, I find it very useful. And I think you might find it well. It's called our pipeline. It's a trait. So a pipeline takes an element. It processes it. So it takes an element of type I, processes it, gives you an element of type M, and then it can eventually return results. Results here we use shapeless's H lists. The reason I use H lists after a minute is because you can't see here. Let's go back to my code. Sorry. This pipeline has two things. It has a process function at the bottom. And then it also has two methods to store combined pipelines together. You can say, hey, I want to take the output of this pipeline and pipe it to another pipeline. So this is really useful to say your input is like sentences, strings of sentences. And you want to count the number of words. You take the sentences on spaces, pop it to another spice, and those I handle words. So you can stitch up together that way. And also you have a long with. A long with is really useful when I do two computations together. Say you're computing the number of words and also the length of bytes of those words. In general, I like to program for these type of things with a functional outer layer, and then a tight, closed, imperative, mostly side-effecting, don't shoot me, but fast, internal shell. So when I'm using these libraries, I can think functionally. But internally, I will throw a lot of things away to make things go fast. A lot of that has to do with the JVM. Yes, sir? Do your streams have any key to take the end of the stream, or are they just constant? They run and run and run. You can call results at any given time you want. It'll give you the current state of the world. So these are just going to stitch up together. So if we go to the source of these work, we can go to the pipeline spec. And of course, this doesn't save. So just to say some examples here, we have a string to length. This will sign the process of strings, returns to an int. So it takes a string, tells you how long it is. We have a work counter. This will take a string, just a few cares what the return type is. Returns a map of all the strings you've seen and the number of times you've seen that word. And this is how you can stitch stuff together. So the code down here, you have a new work counter. You can see here, we have going to do a work counter along with the string to length. We get both results. And we get this nice h list, the pattern master cost when we're done. Makes code much cleaner. The original version used tuples of tuples, but that was just a pain to actually work with. And we can also do things like pipe to. So we can take the string to length, pass it to the summer, and we can actually figure out how many words we've seen. Yes, sir? I'm sorry, what's the next list? Oh, an h list is kind of like a shape list. It's a heterogeneous list. It's basically a way of thinking about a type tuple without dealing with tuples. So you can say an h list will be like an array of things, maybe a string, then an int to stitch it together is all heterogeneously typed. Sorry, shape list is awesome. We should all use the shape list. This really makes things very nice. It's up to the implementation. It can't side effect, it doesn't have to. It's really implementation specific. In general, I tend to program with either side effects there just because I really care about megabytes per second when I'm doing these type of computations. Sometimes this unit, other times, will be like an example of sentences of strings to list the words, an array of strings. So you take the big sentence, split on spaces, and you get an array of strings back. It really just depends. That answer your question? Excellent. So we just simplified everything a little bit. We have steps, we have processes that break stuff up, we have transformers. There is no result so that set int to word would just be a transformer. There's never going to get results from it. You can't get results from it. A simple pipeline just eases the title of it when you're typing things. So finally we can get into the meat of this talk. Computing the cardinality of a string. So given a stream of data, we're going to turn the number of distinct elements. So first is to say there's a lower bound. What I mean here is that you have to store at least some minimum amount of data if you want to do this exactly. And the proof line of course here is actually sort of straightforward. You have n integers in the range one to m, okay? And you're now at some value far along the number of unique, the maximum value. So if there is say the biggest integer is 10, you're now crossing the 10,000th element. Well, how many subsets could there have been? For every integer, for distinct elements could appear in some power set of time, right? It's just, there's 10 elements, right? They can be any possible combination for the distinct things. You don't know which one. So you have to store at least n bits of information here. So if you want to do exactly the sort of n bits of information, that doesn't make sense when n is max end, right? So we can't ever compute things or we can, but we just can't do it well. You know, constant memory overhead. We have to approximate the cardinality of a string. And now we're going to stop for a second. Take a little aside, we're going to go back to maybe first level computer science and talk about some tricks we're going to use here. Hashing. So in this talk, I tried to use notations from the standard textbooks. I suppose unifying all the notations throughout the talk. Because I thought that if you go look at the paper, you can understand, that might be confusing. If there's any problems, ask me a question. So what does hashing do? Why do we care about hashing? Well, again, pretend we have integers. You have some number of symbols. You know, this is some just space of things. You want to, and to say that space has big m, there's a lot of them, okay? You want to take those big set of symbols, map it to a small set, which is, you know, much smaller. So, you know, that's called small m. So m's much, much smaller than m. And we need one map definition here. Because this is actually important to all the literature, at least it uses hashing tricks. You talk of hash families, okay? You don't have one hash function, you have a whole bunch of them. And this is useful, and you call it two universal. If for any x, y, so any element in your symbol set, they're not the same, you hash them. The collision probability is one over m. What does that mean? You expect collisions when you just pull stuff at random to be basically random. So, it's basically, you can think of hashing as like a random number generator, but for your stream. Or at least treat it as such. The picture here is, you know, big m, h i, match with a little m. Okay? We're gonna talk about two hash families in this talk. This is what the literature uses. So here, i, your set of symbols is integer, any integer. You choose a and v at random. You choose some giant prime. Okay? And you just define your hash function, hash family, is a i plus bi, you mod p. So you mod by the prime to get something that's actually provable about. You mod by m to map down to that smaller set. Okay, and this is, this is a two universe, two universe family. It turns out, if you just do a lot of bit math, if m happens to also be a power of two, you can compute this with just shift operators. It's really fast. Okay? So this is what hashing we're gonna talk about. I'm a car carrying mathematician, so I do have one proof. Please don't run away. It's only proof in the talk though, don't worry. But I actually think it's sort of important. This is actually two universal. So, what does this mean? We have a collision. I assume we have a collision, I assume the elements of your hash aren't the same. Okay, we just, when do we have two elements that collide? It means that we have ai plus v for a fixed a, and aj plus v mod p mod m, they're equal to one another. That's what the hash definition is. So, we wanna get rid of this mod and that's confusing. So what are things equal mod m? If they're the same multiple of m. So if zero m, one m, two m, three m, so if m is 12, something like one, 13, 24, all that et cetera will all be equal when you mod out by the 12. Move stuff around, do arithmetic back and forth, and that makes you get this final equation that says, look, if a equals this right hand side, then they'll be equal. So another question is how many we need to count things? How many things can be equal? So how many a's are there? If you have a die, this is equivalent of a six side of die has six total possibilities. This is the denominator of the probability. Well, there's p minus one, because over p you start double counting the number of a's again, mod p. So your resume is p minus one. And now, what we have, what we have a power over, is this l, that's not fixed. So if you fix an a, l equals zero, l equals one, l equals two, all the way up to m over l, will be the same. So l varies in the range, sorry, I misspoke, to p over m, because you want that to cancel. You put this together, you have this nice p minus one total possibilities of which p over m will collide. That's basically one over m, if things get big. The p's just pretend to cancel out. Now this is great, when I show this to people, I say that's great, Mark, we have integers. I don't have integers, I have strings. So what do you do? Well, this is a great trick, it came from Missenbacher's paper, I'm building a better bloom filter. And so you could just take every string and create D murmur hashes, right? Because you think of terms hash families, you have a bunch of elements. That's really slow, murmur hash is perfectly good, you don't have to use cryptographic hashes, it's fast. You could be much faster. So the trick that Missenbacher came up with is that you murmur hash your element twice. So you choose two murmur hashes, that's a lot less than the number you'd generally choose, you've got to do everything uniquely. You just linearly interpolate between them. So your hash family one is h one of x, your hash family two is h one of x plus one h two of x, third hash is h of one of x plus three or two h two of x and so on. So there you can actually get all the properties you need from universal hash family, but only compute two murmur hash functions. This would be much faster. Some libraries use this, some don't. If you don't, you should because it's just, I mean, it's much, much, much faster. And as I said, like why do we care about this? Why do I care about hashing? Again, we're taking even one more aside, okay? Let's just, we have a big sequence. This comes from Foundation of Data Science by Ravi Khan and that's our sequence, one to m. Let's imagine we have the set of distinct elements. Some oracle just gave you the set of elements and now let's also assume we just, uniformly at random, just chose random elements for e. We just pick them at random one by one. We have a set. What is the minimum value of s? We picked m things, we picked it uniformly. So if you look at this picture, the smallest element here will be somewhere around this value where it's the number of elements the whole set is this uniform. Over this number of elements you pick plus one. Well, this will precisely give you the number of distinct elements if you chose randomly well. It's gonna be horrible estimation to be honest because you know, there's a lot of skew but it's an idea. So we can't randomize our sequence. That makes no sense. We can only process elements once but we can hash our sequence and hashing will actually just give you that randomness you need to actually prove and you can prove some theorem about how well you can actually approximate things. The next thing people ask, well, that's great Mark. We can hash things as randomness. You can also do sampling and this is a really important technique we use, I use all the time. We just randomly sample our stream. This is all tied back in together, don't worry. So what does random sampling mean? Well, you have these set of sequence elements again. You wanna choose uniformly m items from s, the uniform distribution. If you could do this randomly and do it cleanly, you would have precisely what that previous result was and that's a pretty simple way to count things. So doing this, what you do is you just, you wanna pick each element probability little m over big m, so the range over the total number of elements. This will work great. The problem is we have a stream algorithm. I don't know how many elements are in this stream out priori. So I can't hit you with this probability because I do not know the denominator over there. So this is a problem but you can sample from a stream. This came out of Vitter it's from 1985, it's called reservoir sampling. It's a very powerful technique and it's actually pretty straightforward. So let's just sort of show the intuition here. Let's say we have the universe as one symbol. There's one and one only symbol. It's A. Well, how do you sample uniformly from this? You just return A with probability one. All there is, right? Let's say there's two. This is store A, I assume it comes first and now we see my processing. And now finally we see a B. Flip a coin, right? There's two unique elements. So flip a coin. If it's greater than a half, return B, otherwise keep A. Just flip a coin. And this will actually give you this nice uniform thing. And you can extend this. You fix K and that's called your reservoir. And that's your sample. And that's the way to store all the things you sample. You process your stream. It comes in one by one by one. Store the first K into your reservoir. Now finally your reservoir is filled. You have a new element that comes in. What do you do? Pick a random number between zero and one. If it's less than, it's interesting here. If it's less than K, which decides your reservoir over the index, you keep it. So what does that mean? Let's say your reservoir is size 10 and you're processing the 11th element. You're gonna keep the 11th element with very high probability with 10 over 11 probability. You're gonna keep it. So you store your reservoir and you just do this over time. So here's the picture. I grabbed this from actually a task scheduling. So this is how you schedule workers for nodes. We're processing our stream. We're going through, we're going through. We notice we kept 18 here because that may sound like the probability. We now get to the very most recent element, 22 over there. Flip a coin. We'll choose a random number and over K. Just as I think we keep it, otherwise we don't. This is simple, but it's exceptionally powerful. I'm not gonna prove it, but there's a hint out here if you wanna prove it yourself. That after processing the entire stream, when you're done processing the stream, you can prove that that reservoir you have at the end is a uniform sample. And again, why is this useful? I use this all the time. Let's say I have 100 million elements. I'm trying to understand. And I just wanna understand the shape of the data. I wanna know what values are common, which are not common. I run a very simple algorithm. I can just get a nice little subset. I can put it on my laptop. And the code here is trivial. So let's go look at the code. Again, I'm fully mint. I'm gonna sign up like crazy to be fast and people can lynch me afterward. Element comes in. So we have two counters. I try to make this thread safe slightly, but it's really not. Don't worry about it. We process the element. We get the current index. It is the first size. We store it. Otherwise, we just choose a double and then we inject a random element if it has a probability. Very simple, but very powerful. Okay, and you can extend this. I mean, it's not important. You can read the literature. It's not important. So as I say, now back to the problem on hand. We wanna count things. Why do you just talk about sampling all these things? That doesn't work, because it's just interesting to fall into the streaming algorithm case. So we're gonna approximate the cardinality of a stream. And now we're gonna talk about hyperlog log. So hyperlog log is a very popular argument. Literally all it returns to you is the approximate number of distinct elements. It came out of a flagelade and I think martin in, but 2008. So this is very recent work. These were hard problems. It doesn't mean solved recently. So let's, let's have some observations. So the goal here is to give intuition all these things work. We're not gonna dive into the implementation details. So imagine you just have eight big big strings. So every x7, x6, x5 is zero, one. Okay? What they define is just called rank. Rank is a number of leading zeros in a bit string plus one. So let's see why this is important. What fraction of these strings are the have rank one? So they have no leading zeros. Well, there's two to the eighth strings in total, right? And this only allows two to the seventh. Sorry, this has two to the seventh possibilities, right? We just remove two strings basically. So what? Half of them are. I don't do math well, okay? Thank you. The same idea is like a weather factor of the rank two. You have a zero, one, you fix it. So now this is a quarter of them. But all the way down to how we treat the form of rank R. There's only one. There's all seven leading zeros and there's a one. And it turns out if you think about this, let's say you have all these numbers coming through you. Look at all their bits. The match rank gives you a bound on how many unique elements there are. If the biggest element you have has rank R, that means the biggest element you have is just only one thing there. You know, that's really how it comes down to. So let's just look at examples. So, what do I do here? Okay, so now we have to make this work, okay? Cause it turns out you can use that approximation, but it's very noisy. It's exceptionally noisy. And this is why we're not going to prove this stuff. Actually making this work in practice is hard, but you actually read the algorithm that's pretty easy to understand. This is, again, we have our eight-bit bit strings. So opposed to just looking at the rank of the entire string, all eight bits, you take, say, the first two. So in this case, you have four use cases, zero, zero, zero, one, one, zero, one, one. And those are just counters, okay? And it's basically the rank of the entire string. You now store into every counter the rank of the last six bits. As you pull up two things, you look at the rank of everything else. Now, this actually will turn out to smooth things out. So we can actually have a more precise example. We have four strings here, one of seven zeroes, whatever those are. M-zero-zero, what is the max rank for anything with that string? Well, no strings here, sorry, zero-zero. So we just call that zero. There's nothing stored there, okay? M-zero-one, this is, what's the max rank of all the strings starting with zero-one in our string? Well, we have zero-one, and we have zero-one. This has two zeros, this one has zero-zeros. So this is the max of three and one. And then the same thing from M-10 and M-10 and M-11. Those are just the idea, it works. And so you just generalize this. All hyper log log does, all it does, it does a lot. It's very powerful. If you choose p bits from the start of every string, now let the number of counters be m to the power of p, so in that case we had two, so it was four. We divide the input stream into these substreams, just like the two bit case. And then we hash things in. And so every counter is equal to the max rank it's seen of any string in its counter. So this is the first part. Now we have all these counters. How do you actually get the answer out? And this is where the hyper log log magic happens. This is where I said use math, okay? There's some crazy magic here. I'm not gonna talk how it works. We just can trust me. So I just wash out a little bit again. You see the corner of the stream is this crazy summation over here. You take the number of elements you've seen, you set the number of counters, you square it, you take two to the minus power of it, all that stuff, it doesn't really matter. It's just math, we can compute it, it's just an array. The magic happens in how you choose these alpha parameters. There's a penny integral, but everyone just looks at the table where you say, well, if there's 16 counters, you use this constant, if there's 32 counters, you use that constant, and so on. This is actually fairly easy to implement. Like I said, how, what, like where does this come from? Basically what you do is, it's called stochastic smoothing. You have all these things that are sort of random. You wanna just find this harmonic mean of the numbers. That turns to be your answer. And when you look at the literature, again, you'll find many examples on the internet of how good hyper log counts and things like that. You'll find that like, look, when there's very few cardinalities, you use a different approach because hyper log log overcounts significantly. Google has a paper called hyper log log plus account of Google research and they say, well, what do you do if the stream is way too big? When you have a large number of unique elements, you have a lot more ability to have hash collisions. They just use a bigger hash function. It's, you know, again, not hard, but we should understand it. So let's go dive into the code and see how our two libraries implement this, okay? And this will give you an idea of the different flavors of the code base here. So we have a hyper log log interface. This is just so I could just, you know, show how stuff works. And I have a process element estimate element. Doesn't really matter what it is. And we have two implementations here. We have algebraed and we have stream with. So let's look at algebraed first. Again, came out of Twitter. This is an exceptionally functional library. Nothing side effects. Every time you process an element, you get a new, you know, container back, which you didn't have to join together. So the main thing that you there is called the hyper log log monoid. That 12 is the number of bits I'm going to take off the beginning of every string. 12 is a pretty standard number. It generally gives the error rate to where you want it to be. To process an element, we use this sort of hyper log log monoid to process the item. You're going to add them back. And now because of the way I side effect things in my process element, I'm forced to update my, you know, one. I initialize this to zero. Then I say, hey, look, take this hyper log monoid and add it to this other monoid. We've got a new monoid back. When you want to actually want to approximate it, you call this special method called sides of, you give it the monoid you care about, you get a number back. Again, this is great. It's exceptionally useful in map reduce problems. Stream lives a little, a lot more straightforward. You say, hey, look, here's my HOL, hyper log log 12. You guys can do the side effecting for me. So I don't have to worry about any of these other bars locking in problems like that. I offer an item, it takes it, and I call it cardinality on it. So this is how you, you know, just use these methods. This is like the classes you want to use. These are the tools you're going to use. This is me just teaching the other for test purpose into our pipeline we use. You can inject the hyper log log interface you want. So like, so this is problem one we saw. We can now maybe figure out a library we want to use to say how many distinct elements are on the stream. Let's get a little more interesting. The next problem, kind of elements above a certain frequency in a stream. So you give me a frequency, you say, hey, Mark, give me any element that appears on the 50% of the time. I have to give you those elements back. So this will come, this is another class of problems. This is by Carpem Papadimitrio. And again, this is more what I call the engineering solution. It's, you know, straightforward, and there's basically no math except in the analysis. And it's very straightforward implementation. So again, we're gonna use their setup, okay? So they define their alphabet, the symbols they're gonna see, there's any unique ones, okay? They get a sequence from their alphabet, you process each element in a stream, and finally you give me a threshold. Thresholds say 50%. A threshold of say 10%. You clearly can have something really, threshold of 75%, you know, whatever you want to do. And what we're gonna really assume here is that the number of elements you process, so the number of elements in our stream is much, much bigger than the size of our alphabet. And that's much, much bigger than the inverse of the frequency. So if theta's 50%, whatever theta's two, right? So if there's say 10 elements in our element in our alphabet and their stream is a billion, we really wanna try to store two elements here. This is gonna be how much memory we're gonna process from our overhead. This is their notation, but F be the number of times the element A appears in the sequence, so it's the count. And this indicator, x of theta, is the number of A's which appear greater than the frequency. So an example here is really the best way to do it. This is our string, there's 15 things you can trust me, I think. I think I can count. So the question is, how many elements appear greater than 50% of the time? Well in this case, there's one, there's I think eight ones. You say, hey Mark, how many elements appear greater than 10% of the time? Well, elements one and two show up because two appears twice. So this gives you a good idea of the problem one. These are the elements that appear the most frequently. So again, you're gonna do this exactly. It takes a lot of memory. At the basic store, all the unique elements you have and their counts. But we're not gonna do things exactly again, we're gonna approximate. So in this case what they do is suppose this can mean all the elements exactly. They say, we're not gonna cut the most frequent elements. We're gonna count the most frequent elements plus some other stuff. And that other stuff just comes along for the ride. And again, their space though is whenever theta. So if you only want to find the things that appear in the 50%, you're gonna store two elements. So the trick here, when theta equals 0.5, let's just say you want, I'm gonna put the majority of the time. How would you do that? Pick two unique symbols. Okay, we actually just work through the example here, okay? Pick two unique symbols, we pick one and two. Remove them from the list. You're left with one, one and three. Pick two more unique symbols. That time, one and three. You're left with two elements, one. Just by construction now, one is the majority element, right? Every time you move something, you move two. So that means one appeared more than anything else. And now this is the element super, it's doing the exact same idea. It's simple, this is very simple stuff. But they hide in their notation a little bit, okay? And the invitation makes this even clear how it works. We have an input sequence. This is now thinking about it much more generally, okay? So k is a set of symbols you've seen. That's what you're actually sort of keeping track of. And count is an array of integers, and that's like k. So think of count as a map. It maps the symbols you've seen to their counts. So for every element you process, you look at the element. If it's in k, you increase the count by one. Okay, I've seen it before, so I've seen it one more time. If it's not in k, insert it. Set the count as one, you've seen it one time. And now when k gets too big, so in the case of the majority element, say k now reaches size three, you just get rid of everything. You subtract one from every single count. Anything at zero, you just throw out. So again, and then why is this gonna give a superset? So it happens that things you process at the end might just stick around for the ride because you just didn't see enough other elements to eject them. So we can look at this in Scala. What did I call that? Carp. And this is, I think, Fred's safe, although, again, I'm always open to work because when it's not, you have a theta, we have a non-blocking hash set of the elements we've seen so far. I'm a big fan of CliffClick's high-scale lib for doing non-blocking asynchronous concurrency things. It's fast. And then we have a hash map, which maps the things you've seen to their counts. Okay? We want to process an element. We say, hey, does k contain our element? Okay, it does. Get it in increment. If it doesn't have it, add it to the list. So we have to make a new counter. We're going to the counter. We put it absent because of concurrency. And then we just call cleanup. Cleanup just runs across all the elements of tracking one, ejecting everything to zero. So now we've seen sort of two ideas of how this works. One is really using engineering and just algorithms. The other is using math. And now we're going to have another thing, which is called using ignorance. So, again, this is the case where we actually use a lot of our services. You want to have a quick ability to save a health check. And you have a health check on every service and it tells you if the percentage of 500s to 200s over the past three minutes, or five minutes, wherever it is. So it's called a rolling window. And here we just use ignorance. We're just going to write something with scales and trust that the caller of the thing doesn't put too many unique elements into it. So let's just take the observations. We're assuming, but not enforcing, that there's very few distinct elements. So, ACB codes, you have 200s, 500s, and 400s. And we don't count 400s. You have to handle concurrency because we generally, at least in our role, we have many threads flying in requests. And so we have to update that concurrently. And this, like I say, is nothing fancy. We just have to be careful. So let's just look at how the picture works, okay? We're going to have a bunch of buckets. Bucket is going to turn about which time window you're looking at. So it doesn't matter. The initial implementation doesn't even have time. I'll show you in a great time at all. But you're now pointing to current bucket. Say current bucket two, you see a new item. You have a map there of all the items you've seen. You add one to the count of us there. Eventually, you advance the bucket, you start adding to the next bucket. And now the count, we want to say how many times I think in the past X buckets, you have some over the count of all the buckets. The code here is, we can look at. The main thing we just think about here is just how we can handle concurrency safely. Again, multi-threaded things, opting this sort of really as a global counter, no matter how you treat it, at the same time. So we initially would, hey, a bunch of buckets here. Again, non-blocking hash maps. Every hash map goes from the bucket to the items in that bucket you've seen so far. Okay, and then you have a counter for the current bucket. So now when you increment, you say, hey, I have a found a new thing. I want to add one to it. You get the current bucket, and you add one to that count of that bucket. We want to actually get the results. So again, I use the pattern I like, which is internally on side-effecting, but I always present a functional, immutable outer layer as you get the counts. We take the data, we get the key set. We make a new hash map. And we just add stuff to it. And then we're gonna add a two-map to it, so it's immutable. But this doesn't have time yet. I said this is a rolling window. I only talk about buckets, right? Nothing advances the buckets. How do you integrate time? And this is a nice trick. I don't know where it came from. It introduced me at my last job, by people much, much smarter than me. An idea is you cheat. The idea is you just have a little wrapper class. You have the window that's, you know, hey, I want to measure things over the past five minutes. And the granularity, which is I want every bucket to be, say, one minute long. So that'd be, you know, total five, total buckets. You just have your own, you know, counter, which is a rolling counter. Basically, you use a scheduled six-to-lay thread pool. And that just calls the heartbeat. That calls the advanced bucket method, whatever the granularity is. It's not crazy, but, you know, the idea is that that rolling counter, who cares about the time there? You don't have to think about time. It's much easier to implement because there's no time involved. It's up to this, you know, scheduled thread pool to actually call advanced bucket every now and then. So every granularity thing, so grand was one minute, it'll say, hey, advanced the bucket. A rolling counter advances its bucket one notch, resets everything, has their thing, and then we go. And counts just passes through. So in the ignorant counts function, there's actually nothing at all that needs time. It's only the runnable down here. This is a very nice pattern I found when you have things that fundamentally involve time and things that are separating the actual time component from the, like, rolling component. I use this pattern quite often. And, you know, look, it's not precise. You know, scheduled thread pools don't always run exactly when they're supposed to, but it's good enough for our work here because it's just gonna be alert. And if we miss an alert, we shouldn't. Okay. We're moving forward. So now we're gonna do the next problem, which is we want to count things. It's an unknown number of things. And it's in a stream. So this is called a countment sketch. Okay? So this is a paper. Again, look, this is recent, 2005. It's 19 years old. And so what's the idea here? This is the problem of, we see this large stream of data comes in. We only ask you to follow one question, one question only. Which is, tell me how many times element A appeared? Element A appeared here. So we say we see a whole bunch of fruits. Apples, oranges, persimmons. There's a whole unbounded number of fruits in the world. Any point in time we ask, hey, Mark, how many apples appear when you give an approximation of this? We're gonna use our universal hash family. This is not common, but this is what everyone uses, at least for the integer case. And now we start with the setup, okay? So our hash family has D hash functions. That's how many things you hash it with. Okay? What's the range of every hash function? Everyone maps in this case, zero to W minus one. So for integer case, you know, we mapped the, we had the big M to the little M. This case you call it W because that's what the paper calls it. So when you think about this, what's to happen is you have every row will be a different hash function and every column will be a different counter. And you're gonna store this in this following picture. So I stole this from Kylie Scalable. This is a great blog post. It actually walks through the details of all these things. This is the picture they have. So we have D hash functions, H1, H2, all the way down to HD. An item comes in. We hash it D times. For the first value, say it hits this column. We add one to it. I added the second one. It hits this column. We add one to it. We hash it again and all these things. Add one to the appropriate column you have. So you have all these counters. Now the question is, what's the count? So let's just think about this, right? We're hashing things. What's the problem with hashing? There's collisions, right? Which element do you think would have the best approximation of the count here? Sorry? The smallest one. That's not the fewest number of collisions. Let's also count count man sketch. That is it. You just look over all the, you have an item with the thing. You hash it with all the different hash functions. You can take them in on value. That's, again, not bad. In fact, like I said, when our engineer saw this, he was going to give a practice talk the other day. And he really wanted to implement this in Haskell. He wrote a beautiful functional implementation in Haskell. And it's in the tally repository under source. How is count man sketch different? Counting bloom filters actually are, we don't talk about count man sketch, but they keep an array, one array. Count man sketch, I like to think of, Lisa's story, please. This keeps a matrix of values. So choose your hash functions. So for integers, you choose, if you know you have integers, you choose that standard one we use, that family with the mod and things. For strings and objects, you use murmur hash, just like I said. Some implementations, like I said, just hash everything d times of murmur hash is very slow. Others use a hashing trick. Clerk's ring uses a hashing trick. And then some Python libraries use cryptographic hashing and that's way too slow and totally overkill. But it's, you know, but you have a hammer, you use it. So most implementations use these two features. And again, like I said, with strings, you use that hashing trick. You choose two murmur hash functions and just, you know, go across them. Yes, another question? Yeah, so let's go look how people do this. Yeah, you can. For this case, it's, why do I have to tell you? I think I can, should look at the code. Yes, you can. There's many ways to read and see what the bounds are. I don't know if on top of my head. And there's much literature here. I think this one is, I don't want to say it because I don't like being wrong. I might have put it at the top of this file. Let's see what I did. I wasn't that smart. So for this implementation, I only showed the countment sketch from Streamlib. Algebra has a countment sketch. It only works with long streamlets work with objects. And it's just more of how do things work? Countment sketch for Algebra is used as accent monoid. You find these things, you add stuff together. Again, it's overkill for in-process type things like I'm doing here, but it's exactly what you want when you're using Spark, which is why Spark uses Algebra. So really what countment sketch is, this is the API that Streamlib uses. If you have a countment sketch object, you specify the depth of width of the seed. And I bet you that they're smarter than me. I can find my mouse. Let's see here. Oh, it completely doesn't give me that. Sorry, I just found my mouse here. There it is. I don't know what I'll do later. I'm sure this notation, they're really good. In this, I'm actually putting an error balance in the memory use case and of every algorithm. In fact, I can probably get to it like this. You exit presentation mode. Aha, there we go. These little fancy things like this. Oh, they didn't put it here. But anyway, they can be linked to the paper. You can also find the results in the directory. I think the way that you can raise the confidence of being under an error threshold, so there's two different things in correspondence to how likely you are to be under the threshold and how low the threshold is based. Yeah, I don't remember the exact equations. Then they're all, basically, how the actual math works. You say, I want to have this efficiency and this precision and this accuracy. Then you do the math, you compute the depth and the width you actually need. It's actually how that stuff works. You see, I mean, now that we're here, I might as well show you how this works. So I can know if they have this epsilon, which is different width, blah, blah, blah. And this is, you know, let's just go down. I mean, this is pure Java code. I like it. They do the hash trick for integers. Notice they do this really nice hash trick. As I said earlier, if you know the elements you choose are primes and multiples of two, you can do hashing with bits and operators. It's really fast. If you want to add a long, so maybe if you add a string, you see they have this nice little, good hash buckets. Best does the hashing trick for you. So they actually just run through all those murmur hashes and do it very quickly. And then you have to make the count. And the count just chooses the minimum of all the table elements. Let's go back to where I want. So yeah, as I said, now this API has two things. You add an item. Soon as I'll count my standard, everything I said counts by one, but you can count by any number you want. In fact, you can count by negative numbers. It just changes the analysis a little bit. So if you want to increment and decrement, you can do that. They support it. Then you just have to count the element. This is very simple. Like this is really just a very simple wrapper just to unify some things. And so this is the final thing we're gonna talk about. Like I said, I really did leave about 20 minutes to talk about things and just talk about way people have counted things they'll want. Like I really want to see much more of a discussion. So this is what we're gonna talk about today. And there's a lot more we could talk about. It's called saving space and summarizing streams. This is the name of the paper. And again, this is another just damn good engineering. It's a pure algorithms paper. And then all the math comes in the analysis. In fact, if you lock me up in a room and you said, Mark, come up with a way to do the top K in like maybe six months, I might give myself a probability of coming with this algorithm. Because everything else is just way much smarter than me. So this came, again, this is actually really recent. This is, I think, 2005. And it's by Agrawal, these people down here. So this is another counter-based technique. So the idea is that you want to update counters in a way that accurately reflects the frequency of the significant elements. Significant just means possible top K elements. You don't want to estimate the frequency. So how many times do you get an element of peers? You want to store this in this always sorted structure. And again, this is just a great, I want some math. So the idea is you get things that come in, you think the top K and this nice sorted structure, you want to get the top K, you just stream across the first column that's in your sorted structure. And there they are. So let's just talk about how this works. I have pictures, we can walk through this one. This is actually one of my favorites, I use this a lot. You have an alphabet A, that's again where all your symbols come from. You have a stream S of size n, that's the element in our process, one by one by one by one. So frequency here is a number of times element E appears. So if you see the word Apple five times, the frequency is five. And the goal is to find the K elements with the highest frequency. So top 20, top 100, buzz feeds top 10, I don't know what you want to do, but that's what you do. So let's talk about a perfect world. Top K, some of the things we have is we have M greater than K counters. We'll talk with the data structure in a few minutes. If you could do everything perfect, you didn't really know what the order, you actually knew the order of elements, things like that. You would just make sure that the element, EI as they call it, with rank I, is stored in the I counter. That way the top K would be E0, E1, E2, all the way up to EK. If you had the perfect world, you just output those elements. But the real world is that, you know, the order of these elements doesn't actually reflect the actual ranks, right? That it's just things come out of order, you don't have to move stuff around, and you don't know the future of your stream. So this element might be not even in the top 10, if you look at the first 100 elements, maybe the first million elements, but if you look at the last billion elements, it becomes the most popular thing in the stream. So what the idea is that, if this I position of the data structure has this element, the rank I, if it does, then you know the counter to the frequency and that it is the I element. So let's just sort of look at this algorithm idea. We're gonna have the idea, we're gonna talk about the algorithm, we'll look at pictures, because I think this is honestly one of the more important ones we talked about today. So we're gonna monitor a total of m elements and m counters, we just pick that a priori. Choose 10, choose 20, choose 50. Just make sure that m is greater than the possible top K you care about. If one of your monitored elements is observed, okay? So let's say I'm monitoring apples, oranges, and pears, and you see an apple, just increment the count of that apple. Now if a non-monitored element is observed, so you're monitoring apples, oranges, and pears, and now you see a banana. Let's assume first that all your buckets are filled, right, you only have three here, so you have to get rid of something because you can't ever store more than this. You find the element with the minimum pre-count, okay? Minimum count, you throw it out, just get rid of it. Now you add the new element you've seen, but what you have to initialize it with, this is really where the algorithm is really nice and subtle. If you initialize it with one, you don't really know, because you may have seen bananas like 20 times, but you've just seen it, they injected it, seen it, they injected it, seen it, they injected it. So one's not the right number to put there. But what's the minimum number that could have peered if you thought, well, sorry, what's the maximum number of times it could have peered, and not be in that list? It has to be the minimum element you just counted plus one. And so an example here should make this a little clearer. We have top K, top M, and two, and just have this weird data structure, count I returns element, and the number of counts it has. So our true is one, one, three, two, three. So we see one, count of zero is one, one. Element one has peered one times. See one again, okay? One, we see two. Okay, now we see a three, okay? So one still says it's two, now the first counter has three, you see it one time. Now you see it two, we have no more space anymore, we have to kick something out. We kick out the third one, but we don't set two to be one here, we set it to be two, one plus the minimum element. You think about it, for all we know, create as far as the stream, we only have a fine amount of memory, we have no history. We could have seen one, two, one, two. We have been ejected already, we just don't know. So we just over guarantee what we have. Let's say we have a three here again. Three's on the stream, you kick out two and you have three. Now the counts are off, three only appear twice, but the top K are preserved. One and three are the top K. So I think we'll have more pictures. This is just interesting about this data structure, if you think about it, it's actually pretty cool. If you monitor all the elements, because you always add the minimum plus one every single time, the sum of all the buckets. Oh, I'm sorry, this is the wrong one, I apologize. There's additional information here that they store in the algorithm. Algebra uses this, clear spring does not. But it's nice about this algorithm is that you can actually have a binary flag on the output that says, I guarantee the solution is correct, or I can't guarantee it's correct or I can't guarantee it's not correct. You just don't know. When it is guaranteed though, you can really trust your answers. So what you do actually, you don't just store the minimum plus one every single with every element, you add it, you store an additional counter, which is also the count of the evicted element. So if stuff moves up the pipe, you have this error estimate along. So you actually know how much you ever count by. It actually makes the algorithm much more complicated and implemented. A lot much more complicated, but it's a lot more subtle. Clear Spring doesn't do it, Algebra does. And again, this goes into the thing that Clear Spring is really about speed and correctness, but really about speed. Algebra is really about being correct and functional and just right. So I haven't talked to this thing that there's still a data structure involved. I said, this is what you do. I haven't told you how to do it. So the first in words and in pictures, we have this data structure, okay? All elements with the same counter, with the same count, are in a linked list together. So if my stream is apple, apple, pear, pear, banana, you have one little bucket of things with apple and pear, and it corresponds to this parent bucket, so these things appear two times. The parent bucket has the count of all the child in this bucket. So they're also linked lists. So you'll see multiple buckets, one bucket per count, two, one bucket per count, three, one per count, five, all with children with the things that share that same count. And the parent buckets are also always kept sorted. So this is what this means in pictures. It really shows what it's doing here. So we have, sorry, this is the wrong picture. There we go. This is the initial state of the word, okay? We have, say three buckets here. Say apples appeared in times, pear and orange appeared in minus one, and grape and berry has appeared one time each. So we think that's the story of this count, right? Now we're gonna process pear. Well, we find the parent bucket that has pear in it. So that's the second one. We add one to it. So pear, which has count in minus one now is count in. So we shift it over. So now the parent buckets have everything apple, a pair of all have seen count in in times, orange standing alone and grape and berry all by themselves. What's the other case? Well, taking a lot data structure, we have bucket three and we see the word grape. Well, we look at the next biggest bucket. That is in minus one. That's way too big, right? We're looking for something bucket of size two. So we add it. So linked list, doubly linked list. We can add it efficiently. Oh, I apologize. I messed that picture up, I apologize. So in between here, we add a new bucket, call it count two, only contains the element grape. Our data structure is preserved, it's still sorted and it's linked list, so we can do things quite efficiently. Let's talk about the algorithm here. This is, again, the algorithm exactly presented as a paper with some notation slightly changed. The basis is exactly what we have. So let bucket i be the bucket of count i. Let's assume you get to bucket i, the next biggest counted bucket is i plus one. So in that case, you have n minus one and n. And now we see element i. We remove the count i, that element, from that bucket. We add one to it. And then we have two options, right? If the next bucket, if the next biggest bucket does exist and the count equal to this new element's count, we add it to it. So it's attached count i to the bucket i child list. Shift right over. If it doesn't exist, we have to create a new bucket. See either the largest element, it's the new largest counted element, or it has to go in between the two previous buckets. So we create a new bucket called bucket new, give it the value of the count i, you just incremented, add it to the child list, move it over. So we actually want to compute the top k. In streamlib, you just output the biggest k elements. You just stream across. And algebra's a little more complicated. And one common, obviously not here today, what we'll see as implementation is that the streamlib implementation, the space summary implementation, doesn't have a zero element, so you can't create the object you processed the first element. And that makes it a little fun to program with, especially under concurrency issues. So we'll go look at it. So again, we have this nice little interface just so I can program stuff together. And it has a process element recurring unit. Oh, I apologize, this is the wrong one. How do I end up here? Where's the call on the space? Let's go find it. Stream summary, of course. Ah, that's called, sorry. I'm not seeing my code here, I apologize. Okay, we have this top k interface. So here's the stream load implementation. You define a top k, it's called stream summary. You tell it to type, you're actually counting. So we're gonna count strings. You give it the width. The width is how many counters you're gonna keep around. You just have to make sure the width is greater than the number of top k you have. I think generally I'll choose like 20 or 50 or something like that. Process, you just offer the element. What's nice about this offer method here is that, although I say this returns unit, stream load actually returns whether the LMO is accepted or not. So you can be a little smarter if you're calling this from outside of this code. Because it says, hey, I accepted this element, it's new, I wanna keep it. Or it says, hey, I did not keep this element. This wasn't useful to me. My computations. Top k as you just grabbed the top k. Appreciate it, Ford. This is Java, so we have to play some fun lines with Java converters. This is a trick I used to remember, by the way. Because I always remember whether Java converters or Java conversions were what I wanna use. Just might help you. Java conversions are perversion. Java converters never cause herders. That's, I always forget. But, I mean, I always forget. But yeah, so we use Java converters here. We take this top k, it's a Java map. We convert it to a Scala map. And we just get the items and we get their counts. Then we convert it to a vector, because I said this returns a sequence, a sorted sequence of elements. Algebra is a little more interesting here. So we have to play some games. So what they call the stream summary. What, you know, the literature called stream summary, they call space saver. It's a semi-group, which means it doesn't have a zero. And I don't understand this decision, but it's what they do, so that's what we live with. So to actually make this work and be concurrent all those things like that, you have to initialize it. Only you process this first element. You can't use it yet. So we have to make a nice little atomic reference to space saver. So in Java, atomic reference says, well, look, this is a pointer, when it's defined, you can actually trust it is defined, otherwise it's not defined yet. But it doesn't concurrently. So you process an element here, you process a string. First, you have to get the atomic reference and make sure it's not null. This is just standard concurrency programming. And then, because some other thread may have set it, in between this checked null and this orc, you have to compare and set it with the actual, see over here, this is actually actually create the element within an element here. You actually have to put an element to create a space saver. So what does it say? Well, look, if it is null, and in between that, check the null and this next one, that's what this comparison means. It means only set this if this is null, still. We actually start the space saver data structure. If that fails, we already have a space saver data structure around. So that means we can add the two together because they use this semi-group addition plus operator. So this SSM is on the monoid, this is semi-group here. We call plus on it. We say, hey, look, take the current element we have stored, we know it's not null. Add it, and add it to the new element with that element right there. You're processing specified. This is just how it works. All right, we get the top K. Well, we use summary.get, that could be null. So we wrap it in an option. What's really nice here is that that option is null and it's a none. You know, you don't want the empty set, there's nothing to return. You haven't processed anything. So we can actually use a fold to clean this code up a lot. So we fold, we initialize with the empty sequence right here. We grab the top K, we map over those values and they give you a tuple, which is, I think, the item's name, the item's count, and they also give you this guaranteed flag, this nice little error count, which we don't use here. And that's how you do it. So, I mean, people ask me a lot whether I want to use Algebra or ClearSpring, a streamlib. The answer is, it just really depends on your use case. Like, this right here really just shows you what each one optimized for. When you have, like, map-reduced things, you know, you do divide and conquer, you divide to all your nodes, you bring them back together, this is actually precise to the type of code you want. You want that, you know, divide and then, so that map, download the things, then reduce, reduce is just adding two things together, distributively. ClearSpring is just easy to use. You know, that's really what it comes down to. And so, where are we? Okay, there's just one property I'll share. And if you count all the counters in the stream summary, you got all the counters, it's actually with elements in your stream. It's a nice little feature you have. And I just show you what it looks like in practice. So, there's a lot of areas I didn't cover today. Blooming filters are a huge bit of work, and that's for doing set membership. You just want to say, have I seen this element before, or have I not seen it before? One way to think about it is like count men's sketch, but with only sort of one little counter there, or one little bit. If you hash it once, you hit it once. If that bit is zero, you've definitely not seen that before. You've seen it once, you've probably seen it before. I did not talk at all, but there's a lot of great recent work in streaming percentiles and medians. So, let's say you have a stream that's replying how long every HTTP request you made in milliseconds. You just want to end the data stream to always keep up the date without having to do a massive computation. What's the 99.9 percentile? What's the median percentile? So, you actually have a nice little dashboard. We didn't talk about distributing counting at all. Distributing counting is things like Cassander counters, or, you know, reoc vector clocks, things like that. When you have multiple nodes, this is multiple threads that you have in your account. We have multiple nodes, all counting stuff at the same time, all going to like a centralized database somewhere, and you have to keep stuff in sync and consistent. And there's probably a whole list of things that I don't even know exist. This area is changing all the time. And so, that's it. We're hiring. And I purposely...