 Good afternoon and welcome to this talk on approximate algorithms for stream data summarization. My name is Himadri Sarkar and I work for SumoLogic. Let me start with a demo which will set context for three different approximate algorithms that we will be talking later. Manages and analyzes machine generated data. This is the SumoLogic web application which can be used to debug and analyze already ingested log messages. This search feature allows you to identify important messages and extract meaningful values and metrics out of them. When you're dealing with millions of log messages, at times it is difficult to write very specific and targeted queries. For instance, this is a query which I ran a couple of minutes ago and it produced around 1 million results spanning across 71,000 pages. Now imagine yourself debugging an ongoing fire in your production systems and you get overwhelmed with 1 million log messages from the same query that you thought was supposed to pinpoint to you to the root cause. Now this is not an ideal state to be in. So what you do is, you start with your initial query and further refine it to return lesser results and subsequent runs. For that we needed a feature which can assist our users to drill down further into such a massive result set. First let me explain to you briefly what is that I am trying to search. This query here tries to do a keyword search over log messages. These are the keywords. Then it tries to identify adjacent string within the log messages and extract some interesting fields from them. Generally speaking it is trying to get metrics corresponding to queries that are executed by our customers. Along with the query we also supply a time range for which we want to search over the data. Now let me run the same query in a separate tab. When I press the start button it kicks off the search. Now we don't wait for the search to finish. As the search starts producing results they are streamed to the UI starting from the most recent message to the least recent ones. Remember these are server logs, each of them have a timestamp from which recent scene is derived. Along with that we also display results related to statistics of each field present in the output result set. We call this the field browser. For instance if I zoom into the left bottom corner you can see that while the query is returning results these values are getting updated in real time. Now what do these values represent? These are the various fields that are present in the output data. You can identify them as columns in a big table. Besides each field name we display the number of distinct values seen so far. So if you look at this field customer ID we have seen 566 distinct customers so far. Going further this number 8000 represents that we have seen 8000 distinct queries so far. And remote IP addresses shows that the users have been firing queries from 200 distinct IP addresses. Now let us look at each of these fields in detail. If I click on customer IDs then it shows the top 10 customer IDs and their frequency of appearance in the logs. It also shows histogram which is updating in real time as you can see. And it gives us a relative idea of this frequency with respect to the other top 9. Similarly for query for this particular window we can see that it is displaying the top 10 queries with the frequency of execution. And if I hover over the queries then it also shows the complete query. You see that we have seen around half a million distinct session IDs still now and this particular window is displaying the top 10 session IDs by their frequency of appearance in the logs. Previously we were trying to analyze logs corresponding to all the customers. Now let us analyze logs of one customer who is running the most number of searches. So I go to this customer ID field name. I click on it and I get the list of top 10 customers. Now to drill down into the search I just need to click on this. And then it will open up the same search in new tab but with an additional filter pertaining to that particular customer. The search has started and the field browser has started populating again. Now you can see that the customer field has only one distinct customer because we applied that filter. Now what you are seeing is data pertaining to only this particular customer. Now you find that there is just one exit code till now and its value is zero which is very good. It means all the queries of this customer have executed successfully. We also see that this customer is executing queries from three different IP addresses. We have returned till now in them there were 17,000 distinct session IDs and going further at the bottom we see that we have utilized 35 host machines to execute search for this customer. Now you have lots of data points to guide your search nearer to the desired result and hopefully we will be able to fix the production incident. Alright so that was the demo and we were trying to do two simple things in that demo. The first thing is we were trying to count distinct values in a stream of data and within each field we were trying to count top ten values. So that seems to be a very easy problem right. We do counting daily in our lives and lots of times you might have done in the code that we have written. So what are the traditional approaches of solving these two problems? It's very easy right to count distinct values you use a hash set for each incoming element you insert that element into the hash set and when the query comes for distinct values you just return the size of the hash set. Similarly for counting the frequency of each distinct value you can use a hash map. Now hash map will contain the value and their counters and you will keep incrementing them. So let's try to apply scale to this simple problem. Now what happens when you want to count one million distinct elements where the size of each element is one kilobytes. So it simply translates to the storage requirement of one gigabyte and say you want to do this for 200 fields then you multiply it by 200 and you need 200 gigabytes. I'm not saying this needs to be a main memory that can be any kind of storage but still it requires 200 gigabytes. Additionally if you have to keep counters corresponding to each of the distinct values from this calculation if you use long values for them you need additional 1.6 gigabytes apart from the 200 gigabytes I showed you already and there is one more thing. Then such queries need to run on one single machine. So we should multiply 10 again which translates to 2 petabytes of storage per node. I'm sure no one will allow you to build such a feature which has this kind of requirement and it's just a single feature not a complete product. So there is something we want. What we want is a data structure which has constant memory footprint. Additionally what we also want to do is we want each of the operations on the data structure to be inexpensive. Now we cannot have a way with everything right. We have to trade something in for achieving these kind of things and let's suppose we agree on trading some amount of correctness which gives us this kind of results. So before I go into the solutions of the two problems I described let us try to gain some intuition from set membership problem. I'm sure many of you have solved this problem in your production systems or have thought about solving it. So what we want to have is a memory efficient data structure that can be used to check membership of an element E in a set S of elements. So as we discussed that we are ready to trade off some amount of correctness in view of that, I claim that there is a very simple data structure which has just two components. One is a bit string, all initialized to 0 initially, and k hash functions. Now what are the property of these hash functions? They can take in any arbitrary object, they may be integers, strings or anything, and what they will return is an index into the bit array. So now the stream will contain two things. The operation that has to be performed and the value of the particular element on which that operation should be performed. Now let us see how we will keep track of what are the elements present in that particular set. So for each incoming element, we have this operation insert. It will go through all these hash functions, there are k of them, and it will generate k indices into this bit array. So what we will do is we will go to those indices and we will turn on the bits. That will help us keep track of, what that will do is it will say that this particular element is present in the set. Now you see one interesting thing. We are not storing even anywhere. We are just storing the information that even is present in the set. Now we inserted n elements. So lots of bits will get turned on. Now what about query? When an element en plus 1 comes and it asks whether I am present in the set, what do we do? We do a similar kind of application. We go through all the hash functions, we get the indices, we go to those indices, we fetch the values, and we do an and on that. Now if the and returns 1, that means that this particular value is already present in the set. If it returns 0, then we say that it is not present in the set. But we should be careful about something. It is quite possible that the elements e1 to en that we inserted prior to en plus 1, a combination of those have turned on all the bits for which en plus 1 would have been turned on if it was inserted. So this gives us false positives. But we will see that we can reduce this value of false positives to such a small probability that this particular data structure will still be useful in lots of cases. So without going into the theoretical proof, let us directly jump to the results. So what, so to use this data structure you just need three things. N, m, and k. N is the total number of elements that you expect will be coming in the stream. M is proportional to the memory that you can utilize for this particular data structure. So what you do is you use the second formula first and compute k. So you get the number of hash functions. Now once you get k, you plug n, m, and k into the first formula and you get the error. If that particular error is suitable for you, you continue with this, these values of n, n, m, and k. Otherwise you go back, you increase the memory and you get the new value of error. You do this iteratively. So this lays path for the first approximate algorithm that I wanted to discuss and this will help us in gaining intuitive understanding of the original solutions to the problem that we discussed. So this is, I'm sure many of you know about this and I've used it. We just described what are bloom filters. So from our understanding of bloom filters comes a very interesting family of data structures. Those are the sketch data structures. So imagine these two images. One of them is raw and one of them is JPEG. The raw image is 100 times bigger in size as compared to the JPEG image. But both of them convey the same meaning to you. Then why would we need to store the raw image? So this is something similar we did with the bloom filters. We did not store the exact values. We did not store the exact ease. But still it was able to tell us whether any EK was a member of the set. Now, sketches have a very good property and that is they can be aggregated. Now that gives us two benefits directly. The first one is if your data is spread across multiple machines, then you can compute sketches on those individual machines. You can just transfer the sketches to one particular machine and aggregate them. For instance, if you were to do this for bloom filters, then you would compute the bloom filters on individual machines, bring them to one machine, and do an OR operation. And that gives you the picture of entire data that was collected in N machines. And N can be very big. All right, so we just wandered off to some other problem. Let's come back to the problem that we were looking at and solve that. So the understanding of bloom filters will use our understanding of bloom filters and use it as a role model. So what we'll do again for counting the frequencies that we will not store the exact elements, but we will store just the counters. In previous case, that was a membership operation. So it was zero or one. In this case, we'll have to store either integer or long counters. And bloom filters used 1D sketches. Let us improve upon it and use 2D sketches. All right. So this is how the data structure looks like that we will use to keep track of distinct counts. So there are multiple bit arrays. Basically, there are d bit arrays of size w. And again, our good old hash functions are there. There are d hash functions. So how does an insert operation for tracking the count works? So whenever we get an element e from the stream, we pass it through these hash functions. They give us d distinct indices. We go to those particular indices in the respective bit arrays and we increment the value. So we are incrementing this time because we want to keep track of count and not just setting the bit to on. And how does query works? So when you want to query, how many times we have seen x so far? What you do is you do something similar. You pass it through these hash functions. You go to those buckets. You fetch all these values. And your frequency is minimum of all the values that you get. Now we'll see why we are taking minimum. That is because of collisions. So for instance, take this example. We inserted x1 and we inserted x2. For h2, that is a second hash function, they both mapped to the same bucket whose value is 21. That means when we insert x1 and x2, every time it goes and increments the same bucket. So this is the overestimated value because two different elements are going and incrementing the same bucket. So that's why we take the minimum from all the bit arrays because that value that we get, it has the least amount of error. And we will see that by choosing appropriate values of the depth and the width of this structure, we can have bounds on these errors. Again, very simple formulas. Let's go to the results directly. So we have epsilon, which is the accuracy that we want to have. What that means is epsilon is the maximum error that we can tolerate in any frequency value. And delta is the probability with which we want epsilon to happen. So this very easily gives you the value of width and depth. Width will be given by e by epsilon, e is the base of natural log, and depth will be given by a ceiling of, again, natural log of 1 by delta. Very simple formulas, and it really works. But you should remember one thing while using this particular data structure. Suppose you have counted all the frequencies, and you sorted them, and you plot that frequency distribution, then it will look like this, right? Because you sorted them. Now the estimates near the y-axis are more accurate as compared to the estimates at the end, or the tail. That is because many of these estimates at the beginning will also go and increment the buckets in which the tail elements lie. And that will lead to higher overestimation in the tail elements. That's why what we do is we generally use this data structure to keep track of top k distinct frequencies. And that's what we did in the demo that I showed you. We kept track of top 10 distinct frequencies. Now to keep track of top 10, you have to use an additional, very simple data structure. Every time you insert an element into the previous data structure, you get the final count. And then you check whether it can displace any of the elements in the top 10. If it is true, then you insert it, and one element gets spilled out. So every time you keep track of top k elements. So what I just described is a very useful data structure known as count min sketch. This is not a new data structure. It is a time-tested data structure, but it has become very popular today because of huge amount of data and the need to process it fast. All right, coming to the second problem. How do you keep track of cardinalities? Now if you had used hash map, then you could have tracked both frequencies and cardinalities. But we didn't store the values. So it will be very difficult to go back to using count min sketch itself to keep track of cardinalities. So for that, we'll look at this another data structure. And before we directly go to the data structure, let's have an intuitive understanding. So this time, our hash function is a little bit different. It is not outputting integer or long values, but what it will output is a real number in between 0 and 1 for every incoming element x. So and the good thing about Hx is it will distribute all the elements uniformly across this number line, which is from 0 to 1. So what it means is if you insert 10 distinct values, then they will lie at a distance of 0.1 from each other. So this good property gives us a very good insight into how we can compute cardinalities. What we can do is we can just keep track of the smallest hash value that this data structure has produced. Now, because our assumption was that this particular Hx will uniformly distribute the values, that smallest value will be the distance between any two elements. So to count, to estimate the cardinality, you just divide the total range that is average distance between any two elements, that is 0.1, and which gives you the cardinality as 10. And in actual, we had eight elements. So this is very much near to the actual estimate. Now, of course, you will say that keeping track of the smallest one might be inefficient. It might give a huge bias. So what we can do is we can go forward and keep track of the k smallest values this time. So even this is not accurate. For this case, say the kth smallest value seen is 0.3, you plug it into the formula, you get 6.7, but that is more nearer to the cardinality that we got in the previous example. This is the basic understanding with which cardinalities are determined using approximate algorithms. So let's go ahead and change our hash function again. This time, say our hash function outputs a bit string in which each bit has 50% probability of being turned on or off. Then we can easily say that there is 50% probability that any of the hash values generated will have one in the first position, and it can have anything later on. And we can say that there is 25% chance that any of the hash values will have 0.1 as the first two bits and anything rest after that, so on and so forth. So what this tells us that if we hash four unique things, at least one of them will begin with 0.1. If we hash eight unique things, at least one of them will start with 0.0.1. Conversely, let us converse them, and then we'll see that if the lowest index of one in the bit array is two, then you can say that there are two raise to two distinct elements present in the data structure. If the lowest index of one is three, then we can say that there are two raise to three distinct elements in the data structure. And I'm counting one from the left-hand side. But again, there might be some skewed data which might give you wrong estimates. So what we do is we maintain multiple estimates and then combine those estimates to get a single cardinality value. So what is basically done from the hash values that first k bits are chosen to index into a particular bucket, and the rest of the m minus k bits will be used to do what we just saw to determine cardinality. And then we combine the cardinality from all the k buckets and get the final estimate. Now let me give you a more intuitive example if this doesn't make sense at all. Say one of you comes here and tells me that he has been flipping coins since morning. And the continuous runs of one that he got, the maximum of them, were just two. Then I can assume that he didn't flip the coins a lot of times. But if someone else comes and says that the maximum run of heads that he got is 100, then I can easily derive that he has flipped the coins a lot many times. Because the event of getting 100 continuous heads is a very rare event. And that can happen only when the operation, the prior operation, has been performed a lot many times. So the data structure that we just saw is known as hyper log log. This is widely used in the industry today to compute cardinalities. So coming back to my demo and let me set some context. So here there were multiple fields for which we were keeping track of cardinalities. And then within each field, we were keeping track of top 10 values by frequency. So what we did is we have account mean sketch data structure and a hyper log log data structure corresponding to each of these fields. So let's see what was the final amount of resources that we used. So we allow 200 fields at max in the output data. The count mean sketch, we chose the parameters of width and depth as 100 and 6. And that works very well for our data distribution and for 1 million results. We chose an error, HLL we chose the hyper log log data structure we chose with a bound of 0.02%. Now you can build this hyper log log in a way that you take the error percentage itself and it gives you the values of k and m. And that translates to 3 kilobytes. So for 200 fields, we need 200 megabytes of memory. And I said in the beginning that we run 10 concurrent queries on a single machine that translates into 2 gigabytes. So now we can actually do all these operations in memory. And there was just 7% increase in CPU. So we also said that we want a data structure which is not CPU intensive. And this 7% increase is because of computation of hash values. All right, that's what I had. Thank you very much. I can take any questions if there are. Hi, so my name is Sanket. I have a question on two slides. The one where you compute the width and the depth. Yes. Can you get the slide, please? That's a lot of questions. Yeah. No, no, I'm asking. This one, right? Yeah. So maybe I lost at this point. But I want to know, how did you come up with this formula? I didn't get the depth thing. Yeah, so that's what I said that we will not go into the math behind this. I just presented these results. There is some rigorous math that you need to go in and do for getting this. So we can discuss it offline. OK, fine. And there's another one where you had said that if we consider a hash function, right, the 50% probability, you have a probability of having one in the most significant place with a probability of 50, right? And if you go down, then you will have 0, 1, right? Then in the next slide, you said that if we have four unique things, at least one of them will start with 0, 1. But that's again with the probability, right? Because we also have this property of the hash function that it will uniformly distribute the incoming elements across the number line, right? So if you have 0, 1 in the first place, then there is probability that there are four elements because that is the minimum value that you see. OK, yeah. Cool, thanks. Yes. On the same hash function that you are saying that it will distribute it uniformly over the probability, how will a hash function distribute it? Hash function doesn't have that much information, right? Yes, so it's not just about a single hash function. You have multiple estimates. And you try to choose good hash functions, but you can definitely not. That's why it's not just a single estimate that you do. I mean, you do multiple hash functions. That's why we were using multiple hash functions. It won't be a static manipulation on the data which will give you a unique hash, right? It will have to have a feedback of how many numbers have arrived already. And based on that, it will decide what the frequency of this particular number, let's say, this particular element to be in. Like, how will you find out that even has occurred this number of times if you don't have all the previous elements stored? Yeah, so that's why we had this data structure. This is a sketch data structure. It's a probabilistic data structure. It will not give you very exact values, but by choosing appropriate parameters, it will give you some value which is very near to the actual value. And we can do, I mean, you can tune the parameters of course. Hello. Yeah, maybe I would have understood it wrong, but what I understood from the PPT is like, you have a two-dimensional error kind of thing where you are storing all the hash and finally you're merging it with all the nodes fine in the. Yes. That you are doing. Take, for example, you have passing all the hash, what you're saying. In the first node, it is present. Now there is a huge probability. You're talking about the bloom filters? Yeah, yeah. So the length view which you are deciding it. The length which you are deciding. Take, for example, you have decided for the 1,000 or you're taking a 2,000. There is a high probability that the particular thing, element which you are going to count it, may not come in the 2,000 part or may it be coming in the last thing and in that case, what the result we will be getting will be probably not correct. So you are saying that what happens if it spills out? If the hash function produces some index which is outside of. No, so that's why you choose a hash function which outputs the value between 0 and m minus 1 where m is the size of your sketch. Take, for example, I have a user. Okay, I'm coming from a website and I just want to take a user who's coming on online. So if it is a real-time kind of thing and you are going to check the cardinality, you want to take the top 10 kind of result. You were talking about the top 10. I thought you were talking about bloom filter. So if you're taking a top 10 kind of thing, in that case, what is going to happen? That user might get skipped because there are n-number user already present and within a millisecond you are losing his data, isn't it? We are not losing the data. For every incoming element, we are inserting something. So where is the position of the 1000 part you are incrementing that matrix? Where you will increment that? So we have like K different hash functions, right? And there is a bit added corresponding to each hash function. For each hash function, you get a different index. You go to that index and you increment the value. And then you take the minimum of all the values that you get from the different bit arrays. Hey. Yeah. Yeah, so you said multiple hash functions help. So but won't a single bad hash function basically skew the entire result set or increase the margin? I mean, a single hash function can potentially cause that, right? So if that does, that's why that is one of the reasons we use multiple hash functions. Yeah, so but if you have a single bad hash function that will skew your, you know, let's say if you have two hash functions which gives quite closer results, right? In that case. You were talking about the case in which they are dependent on each other maybe. Correct, correct. So yeah, so things like the count will totally be skewed. Yeah, so that's why in case of bloom filters all the hash functions are mutually exclusive. You have to make sure that hash functions are mutually exclusive. And in case of count means sketch, you have to make sure that they are pairwise independent. So that, so if you do the derivation, these two things will be used to get the final error rates that we derived. Hey, Himadri. Himadri. Why there was a need of two dimensional array. Yeah. You started with one dimensional then you. Okay, right, right. So in case of bloom filters you are saying, so this also comes from the fact that in case of bloom filters we need more rigorous hash functions. They are mutually exclusive. In case of count means sketch, those hash functions are pairwise independent. So you know, if you choose just a single array, there will be a lot of pollution of one counter with because of other counters. No, but you could have. But there are actually variants of bloom filters which use 2D sketches. No, no, so just to avoid the collision you had that two dimensional array but you could have had a bigger one dimensional array that could have given. Yeah, so yeah, that's what I was telling you. That in case of bloom filters, hash functions have one more property. I just didn't want to go into the verbose details of that. That's why I didn't present that. These hash functions and bloom filters are mutually exclusive. It's more rigorous to build those hash functions as compared to the hash functions in count means sketch. They are just pairwise independent. So pairwise independent hash functions are easier to build but they have a property that they will not work very well if the space that you are updating is same. That's why we have different spaces as in different bit vectors corresponding to each hash function. Hi. Hi. So yeah, so I was curious to know about the error bound, the function that bonded the error for bloom filters. Yes. So what are the parameters that it was dependent on? So yes, there were three parameters that is n, m and k. So n is the total number of elements that you will see in the stream. You have to estimate that somehow because you have already seen your data. And m is proportional to the amount of memory that you can give to this problem. Okay, okay. So you also talked about that you can combine results from various machines to find out the count and all. So I was asking, so does this error bound remains? How does it vary with the number of machines? Right, so when we were talking about count mean, when we were talking about sketch data structures, we saw that, I mean, I talked about this that you have multiple estimates, you combine them. When you combine them, you have to make sure that each of your individual estimates had those parameters that when you combine, you get the desired results. So yeah, so you know. So I mean, you have to estimate the value of n upfront. I mean, the. Yeah, after combination, what will be n? That is what you have to use for individual sketches. Okay, okay, thanks. I mean, total number. Sorry, n is, m is, n is the size of the bit array. It depends on how much memory you can give to that. So if you increase m, you have to give more memory and your error rate decreases.