 Did the original talk that I heard at Apache con and he was going to do this one, but but couldn't make it so We'll see how well this goes Anyway, this is me. Do you want to get hold of me later? And we're gonna talk about data sketches We'll talk about the conceptual framework. I'll go through a case study We look at quantile Processing in a little more depth than anything else. We'll look at some of the other algorithms and then some additional information References so what are data sketches there? Well first off it's currently an Apache incubator project Which means Apache's got it. It's I've been Release to Apache. They're going through the process of making sure it's got all its licenses and good things like that But it is in production. It's used by a number of companies in production currently So we're Apache's basically just cleaning it up and making sure that all the licenses Like I said are valid or meet the Apache standards things like that but the the data the program itself is a it's a StoCatic streaming algorithms called sketches that's the general term in the industry and these are Basically processes that that deal with very large amounts of data in a very small space and they give you estimates values with Mathematically provable error ranges And the Apache sketches are available in Java and see and there are C++ and Python Okay, so they're fairly recent in development They're also called approximately approximate query processing the The idea is you extract the data in one pass, so you're not going to handle your data multiple times And the the term sketch really comes from the idea of like an artist sketch where you kind of get the idea of what the Landscape looks like without having every little detail data sketches do much the same thing They're going to give you an idea of what the data landscape looks like without giving you every little detail Okay, so some of the common properties of these sketches are there they are one touch that is that you don't process the data more than once They're mergeable, so you can run the processes in different streams and then later merge those streams together They are sublinear meaning that while they start out at a given size That it's much smaller than the data set that you're looking at as your data increases The the size of the sketch does not increase linearly with the size of your data So they're much smaller than the data stream And they are the query results are approximate and they have a well-known Error bounds and you can make this make decisions early on to trade those off the size or the for the the error They're also like as a design for large-scale processing and then from Apache you're going to get the what you expect from Apache It's a there's maven deployable It's got unit tests and testing tools are all there and there's the documentation and the developer resources that you get from Apache So we'll look at a case study Project called flurry and In flurry there were it was a process or a project that managed Mobile app to it was for mobile app developers to manage their products Okay, there are 250,000 or more than 250,000 mobile app developers on the system They were producing 50 to 60 terabytes of data a day or 40 to 50 terabytes of data And then at the you know as a mobile developer you want to know how many visitors did I have on my site? How many how much time did they spend on various? Parts of the site. What what were they doing? To do those kinds of queries on the before the data sketches were implemented It would take about 80 billion virtual core seconds So that's you know spread across all the processors they had and if you wanted to know at the end of the day How many you know unique visitors you had that did such and such a thing? It would take two to eight hours to run those queries for the users for the app developers This isn't Something that's sustainable. You know as an app developer. You really want to know a little more frequently than that Be really nice if you could do it in real time You know and if you wanted the questions over weekly data that would take even longer take days to get that once they implemented the data sketches the Virtual core seconds dropped, you know, it's a 20 billion so significantly cheaper to operate and on top of that It gets much faster. You can answer these questions now in 15 seconds rather than days But again the previous the earlier one you got exact counts and in the second case We're going to get approximate counts. So you're trading exactitude for for speed and cost Right, so we'll look at the quantile sketch first and then look at how that works and in some depth so First of all, it's an estimate of distribution for this. You don't know and you get Basically if you have something that you can compare that you can put them in an order Then you can run a quantile sketch So this could be the amount of time people spend on a page for example And then you get a rank here a fractional rank tells you you know what how much time each people or how many people there were across it You get so you get the first one and then you can take the first one You can actually transform it to the second one, which is you know, sort of what your management wants to look at or Somebody who wants these reports they want to know This night they like to look at these nice charts that show you this nice bell curve kind of things that yeah They've spent about this much time That's the kind of thing you can get out of a quantile sketch Not the the Java code to do this is fairly straightforward and I'm gonna actually do a couple of things here I want to show that we're going to create two sketches first off So this could be two data streams where you're looking at what are people using doing on different systems? And you want to merge them together so perhaps over the same time What did they do in the last hour or it could be this is the data stream for day one and the data stream for day Two and when I merge them together, I'm gonna have the data for two days Yeah, so you can you can do either one of those things in this type of arrangement. So what we've got here is we're building the The first data stream we go through and we basically create a time sketch Called a time one sketch and time to sketch We're simply looking at the double values coming out and sticking them into the into the sketch All right, so then I'm presumably we've stored them now So the second part of this example We load the time sketches back and there are simple mechanisms to read and write those from file systems or databases So you read the two sketches back and create a union you put the two two sketches into it And now we've got a sketch that is the result of the the union of those two sketches So like I said that that new result is either the two that the two days or you know the one hour of across the two systems And and then we can print out the results so we can the first case here We're looking at the the min the median and the max value. We know that it's a It's gonna be between zero and one so we're looking at zero point five and one And I'll give us the minimum median and maximum value the second case is we're looking at We want to know the values that are less than infinity or from negative infinity to not to from two to not zero zero to Not to actually should be negative two sorry then two to not zero not to and then two to infinity All right, and this Breaking it down into the bins here is going to break the the histogram down into those ranges Then we simply print it out and we would get a chart that showed that and then finally we can do the frequency That's the probability and then we knew the frequency where again, we're breaking it down and creating a frequency histogram at the end Okay, so some of the other The sketches we have are the the count distinct So this is you know, you've got you want to know how many unique visitors you had you want to know how many times a unique event Happened this would would be able to do that. There are a couple of them available there. You've got the the theta sketch framework, which is It It's ideas its main goal is that you can do these tests both on or off Java heap that's like new in the system space or not And it provides the three things you can do with it You can do the union you can do an intersection and you can do different operations on it So if that if those are important, you probably want to use the theta sketch the hyper log log, which is a second type here It's much smaller. It's So if you if you're concerned about doing something in a very small space perhaps on a mobile device You might want to use a hyper log log And then but it can only do union operations So if you've got to do the other operations, this isn't going to work for you And then finally there's a new one that just came out this the CPC sketch that compress probabilities probabilistic counting Don't know much about it, but I've been told that it is smaller more accurate provides the best Accuracy for the amount of storage you're using so it's better than the the theta sketch in that respect respect We have a freedom going items sketch. This is also heavy hitters So if you're doing a shopping site, you want to know, you know, what are people buying right now? Those of your heavy hitters So you might you know put those at the top of the page and try to get more people to buy them or something like that That's what the heavy hitters does and again This is a case of the bigger the sketch the more heavy hitters are going to be able to detect if you have a very small sketch You might you know get two or three depending on on how what the distribution of data looks like in the actual data Finally we have a tuple sketch which is takes the theta sketch and extends it so you can add other data to it So you have a unique count kind of thing you say unique counts with you know They they clicked on this thing however many times they clicked on it that kind of associate of data That's what the tuple sketch will do for you And then we have sampling so we have like that the reservoir sampling problem Where you're looking at a stream of data and you want to get a good random sample across your stream of data The sampling sketch will do that And it does it with with objects not just so you can look at objects coming across you can do some some work with that Frequent directions which is this single value decomposition which is something I don't understand actually I've looked it up, and I still don't understand it, but my understanding is that it's actually used in finding similar First like shopping comparison site or shopping sites for somebody said that the other guy bought this Maybe you want to buy it That's the kind of processing that this would be useful for if you know the term single value decomposition You'll understand what this slide means so and then we have the number of places from Apache that you can get the the information on the the system and We'll end up here with I want to thank Lee Rhodes who couldn't be here today But he spent an awful lot of time with me on the phone trying to get me to understand how this works so I could come talk to you about it and Then these other gentlemen are all people who helped develop the data sketches over time In the last couple years or so And then I've got a list of references and these slides are available on the FOSDEM website As a link from that this talk, so if you want to come back and get any of the information feel free to do that And that's that's it for me. I have 30 seconds left It's your time for one question, I don't know no because the microphone won't come on save Well, I'll be up tired if anybody wants to talk ask questions