 So thanks everybody for coming to the session for the data session generally, but also for mine in particular I'm gonna talk a little bit today about anomaly detection on streaming data, which is something that I'm really interested in I like doing streaming analytics and stuff on streams It's fun for me So if you are hearing and seeing the stuff that I'm saying you think it's cool You can tweet a Drake hashtag hacker. That's fine If you hear what I'm saying and you think it's crazy and this guy doesn't know he's talking about a Drake thought Nobody on Twitter will know what's going on, but we'll know In here, it'll be it'll be for us. So credit to Josh wills by the way for this. I thought it was I thought it was funny So I put it in the top So a little bit of background on me. I study applied mathematics years ago But I've been interested in technology for a long time So I actually was doing programming stuff before I went to school to study math And the reason I studied math was because it was harder than doing programming. So it's kind of it's kind of weird I worked in a lot of startups over the years in addition to some bigger companies as well So I've had the experiences of building new things at growing companies and also trying to take care of big things I had existing companies both of those come with their own challenges And I've worked in all kinds of different industries. So I worked in financial services for 10 years I worked in oil and gas for a while in health care e-commerce Online travel. So I've I've kind of had a very broad experience in terms of what I what I've seen over the years Almost almost 20 years. I guess doing this. Can you hear me in the back? Okay? Good. Great So The idea for the talk today is is about time series stuff specifically. So there's this nice Convenient, whatever you want to call it Article that time series is the new big data. I think this is from map are they put out a lot of stuff like this because of course they're trying to sell stuff to analyze the time series, but they have they have a point and You also see this in the last couple years with some of the infrastructure tooling that's come out So as anyone here heard of like Apache Kafka, for example a couple people Kafka. So Apache Kafka is basically a log system if you want to think about it that way where you stream logging events into it It's append only so you just push stuff in there and then other consumers can hook up to it and read the data back out So each of these events is effectively a time series event it's something that's happening temporally in some kind of order and We won't get into ordering and distributed systems or anything like that today But the whole big data concept movement and I'm not a huge fan of that term just for the record I'm one of those data people that is a little Uneasy with the term big data same thing for IOT the internet is an internet of things I don't know why we need another term for it, but hey, I'm not the marketing guy. I'm the tech guy So that's that's my side but What I'm seeing over the last years, especially over the last I would say Four to seven years is a lot more emphasis on time series analytics and processing abstractly instead of More the transaction processing kind of stuff that we saw in database systems in the 90s or maybe in the early 2000s So people are thinking about this problem more generally instead of like how do we build olap cubes to do analytics on our on our Sales or something like this you're seeing systems that are coming out that are designed to process time series events And I think this is only going to become more common. So as we get more sensors Producing more streams Sending data to more places. We're going to have to have More advanced ways of dealing with this time series data I don't I don't think there's any way around it and there's some quite cool stuff going on in the research literature So if you want to read research papers on this There's amazing stuff going on in compressed sensing and all kinds of ways that you can analyze this data efficiently So we're going to talk a little bit about one kind of particular way today as an example but there's a lot going on out there and The problem that we're trying to solve generally Adjust my each to my I can't adjust my HDMI, but I'll tell you what I can do. I'll tell you what I can do I Can get rid of my red-shifted video Who wants blue light in their screen? I don't it's just okay better Cool. So the problem we're trying to address today is actually this one So if you have some stream of data some time series, whatever You want to be able to detect when things change and as a human like we can look at that and say, okay That's where it changed. It's very clear, right? But if you have half a million or a million of these different times There's no way a human can look at it. You need a machine to do it It's there's there's just no you can build this nice wall of screens Does it who here works in a company that has a lot of dashboards and stuff on screens? Wow a lot of hands. How often does anyone ever use those and look at them? Never right like maybe for the first couple days You have alerting Like Nagyos or something Seren okay, okay, and how many how many of those alerts go have an inbox rule that goes directly to like archived Garbage wow you're brave you're brave, okay So as this gentleman saying the problem with this is there is a ton of these metrics And there's no good way to deal with them, right? So a lot of people are doing alerting threshold alerting is one example I will talk about in a little bit, but this is the problem and it's a big problem Especially it's not a new one though if you if you look at companies like Singtel or infrastructure providers They've been dealing with this a long time So it's it's not a terribly new thing, but it's becoming a bigger thing as we have more devices producing more streams So this is kind of the problem we want to identify weird stuff that happens in data streams everybody cool with that, right? So there are things out there that do this kind of monitoring alerting and anomaly detection so Raise your hand if you use or have used one of these things at one of your companies Is it graphite? Yeah, okay, so these things are all out there and some of them are great for different things so Like Nagios for example is more an alerting thing Prometheus is kind of a metrics and monitoring thing You have these different frameworks for anomaly detection breakout detection is one of them Atlas a skyline was one that came out of Etsy. It's in Python. It's quite interesting no longer maintained But we'll talk about that more in a little bit Bosun's one from Stack Exchange. So companies are building this stuff It's like a real problem. They're trying to solve But the issue for example is a lot of these are really heavy systems It's actually a real pain to set them up So I don't know if any of you have actually had to set up these systems, but even even graphite is kind of It's not easy. It's not easy. You have to there are a lot of moving parts. There are a lot of moving parts and some of these open source anomaly detection systems are built on top of things like open TSDB which requires H base which requires to do which requires you keep it which requires HDFS and before you know You have an entire infrastructure team so that you can do your anomaly detection, but you're laughing But it's true. I've worked in companies where it's snowballs Some somebody usually like a well-meaning product manager will say they want to have some kind of metrics about their product And it spirals out of control and sooner or later You're building this anomaly detection system and you have a whole like big infrastructure team supporting it So it's a real thing and it's a problem because people are coupling the analytics at the time series with the anomaly detection So they're kind of coupling the storage part like we want to store all these time series that we've ever generated Along with the anomaly detection. You don't necessarily need to do that If you're analyzing the stream as it's coming in you can separate out the storage and the historical analysis from the kind of real-time Anomaly detection so decoupling those things is a good step And we'll talk about that a little bit more. So this is an example when I say it's heavy I mean it's heavy right if you this you probably can't see it, but that's okay. It's confusing anyway. So this is a Pretty basic overview of how you would set up graphite Which is a system that just collects metrics from your from your machines from your software, whatever and gives you Some graphs basically about what the performance and what the behaviors like so you have all these metrics that come in They go into the system called carbon relay which then splits them out into different clusters that then puts them into this database Called whisper, which we'll talk about maybe in a minute about why that's problematic And then you have some web applications in the back end and stuff so that you can actually interact with it So just setting this up you can imagine is kind of problematic The other problem is it doesn't really scale up very well at all So in some companies they can get by with this and if you can awesome Like don't worry about anything else because you're in a good way But for companies that have a lot of streams or a lot of data this this system will really fall over and it when it falls It doesn't like it doesn't like kind of chug along a little bit. It's like really false. It's not good the a lot of the reason being this this whisper data storage format has one file per time series and When you start having half a million different time series coming in at once just the disk activity trying to hit the files is It's just not sustainable So this is just one example and maybe the easiest example Maybe not but maybe the easiest example of the kind of systems you have to build if you want to solve this problem And you want to use some of the systems that are available now So I talked a little bit about Skyline, which was an anomaly detection system from Etsy And you'll know this nice clean box that says graphite is This whole thing so like don't be fooled right don't be fooled So inside this box you have all that stuff from before keep that in mind And then you have skyline which actually does the detection of the anomalies and as a system It's it's the design is pretty cool. I like it. It's pretty straightforward They don't maintain it anymore, but it's still up on github if you want to look at it And it has a few different components It has this horizon agent you can see that kind of takes everything in from graphite And it has a kind of an analyzer agent that runs through the time series and does analysis and stuff like that And it uses Redis as an intermediate store for this data And it has like a nice web front in that you can hook up to if you want to check some of the data out so In theory we're getting simpler right because like this is really complicated and this looks a lot cleaner But it's still Kind of tough to get it set up the other the issue with skyline is it also doesn't scale up so when I was at sky scanner for example, we tried to use this for anomaly detection and Turning on just a fraction of the metrics from our data centers just completely killed it. It's just didn't work at all So I started experimenting with some other Ways to do this and here's this is what I want. I want metrics that go into a box and that's it I don't want anything else. I want one executable The spark guy is smiling back there like it's not distributed, but trust me. It's cool So this this is this is what I like, right? I am if you need distributed systems They will save your world if you need them if you don't need them and you're using them Because you think it might be fun. It's it's not gonna be fun like the the fun is not going to outweigh The setup cost so if you really need them use them life will be good for you if you don't need them Avoid avoid. So this is where we want to go. We want to go from this kind of to this which is Seems simpler, but it isn't kind of down to this which is about as simple as it gets so The question comes up. How can we do this analysis on these streams? Right because we don't want to store a lot of the data because that's expensive to store it If you have like a million different metrics That starts taking up space. So really you want to start computing things as you're as you're receiving them So one way you could do it Don't do it, but one way you could do it is just assume that your data is normally distributed and Gaussian and all this Nice stuff. Um, did anyone here besides Alexander study math and me and the math math people? Really? Like you majored in mathematics. Oh, okay. That's okay. Does anyone here major in mathematics in Europe? Just us. Wow. Okay. So the idea here is you're computing the standard deviation of Some time series, right, but you're doing it in a way that you are Computing this incrementally as the values come in. So you're you're just getting a value updating your Your mean and your standard deviation and you're tossing it away. So the only thing you're keeping are these summary statistics and The problem with that is You're making some extremely strong assumptions about your time series Like really strong like the fact that it's normally distributed, which it probably isn't it may not even it may not even be Unimodal in the sense that it may have not one peak but like two peaks in which case this this will be completely wrong But as an example, it's quite it's quite useful. I'll show you why in it So don't do it, but it's useful for our purposes now So it will detect these point changes like these step changes or these spikes It would be detected but the problem is that approach is really sensitive to outliers Which is kind of a silly thing to use if you're trying to detect outliers Because it will change your your distribution and change the parameters So if you have a mean that's streaming and then you have one value that spikes up really high It's going to shift the value of the mean in a way that you don't want But it will it will still work. So If I have some time I have time you're so captivated. You're not tracking time No, I we have another about three three minutes five minutes, right? Okay, cool. So To give you any wow, that's really small Okay, that's also not working Little better, okay Might work, okay, I Think for all purposes it'll do. So what I have is And that's going to be tiny also. Okay. Can't see the code too much. Oh, okay That's better So I did a little like mock-up and go because I've been playing a lot with go recently So this is kind of a test client that generates Graphite compatible Metrics that's really all it does takes a random string for the metric name and a value in a timestamp nothing nothing crazy and then on the back side I have a Receiver here just listens on UDP for reasons that we can talk about later if you want and it will Store these metrics compute the So when it adds a metric it will actually update the mean and the standard deviation and all that jazz as it goes and We can do this anomaly detection in a completely streaming way. So it doesn't store any values This is the important part when you have super high cardinality Time series data like you have a million time series, right? The less you store the more you can analyze and if you Are doing this online then the analysis is kind of happening as the data is being received So you don't have to have a separate thread or something that's doing the analysis so Just as a quick example if you can see so what I'll do is I'll start the test client with Just producing Normal values with like mean one mean zero stent mean and normal standard mean and standard deviation And it's like five metrics right now So that's why you're seeing like maybe five only five things I just limited it to five so that you can actually see it and you see the the mean and the standard deviation are being So it's numerically close to zero and numerically close to one. We're not keeping any values That's the important part to keep in mind here. So it's actually computing this as it's streaming it So we're pretty close to where we expect to be there if I stopped the test client and make a change so that the Values are now like wildly different. So I don't know if you can see it. So what I've done is I've basically changed the Variance of the data to be super wide. So it's you're gonna get a lot of outliers now And then if I restart the test client You'll see on the end those arrays on the end start to fill up with anomalous data So what we've just done is basically detect those kind of things that we said we wanted to in the first plot So we've we've actually successfully solved that problem in this way now you don't want to Do this in practice because there are much better ways more numerically stable and Appealing ways to actually solve this time series problem So for example, if you if you do it this way your mean will shift over time So it's actually better if you have a kind of sliding window and some kind of decay Exponential decay or something inside the sliding window so you can keep a little bit of history, but not so much and you also have more representative Statistics for the data that is being generated, but for simple proof of concept stuff like this. It's really easy It's really fast. You don't need all of the Super heavy Frameworks and stuff that are out there. You don't need this. Where's where's our nice? You don't need that To do this. It's like one. I think it's like 150 lines of go code to do this It's not a big deal and I've tested it up to over 80,000 events a second on my laptop With the cardinality of 500,000 time series. So 500,000 distinct time series metrics 80,000 events per second on a super old MacBook Air. So it's it's not that hard to do you get tons of Performance out of single machine analytics if you can do it effectively if you can do it online So if you don't need a cluster, don't go for the cluster if you need the cluster talk to this guy He's the expert and I'll be happy to take any questions that you have First If you were building a production system for this you would actually want to have some kind of notification that the anomalous event Happened and what you do in that case is sort of up to you as the builder of the system You can store it locally in memory if you want but it's going to take up memory and that's memory You can't use for your other metrics so what you probably do would be to like fork off some thread or something to give the information and send an email or notice and slack or Put it in a database or whatever you want to do in your system. Yeah, you had a question also Same question. Okay. Yes So you're saying you have a system that generates data coming from two separate distributions depending on the time Yes, and what do you do in that case in order to know that it's actually not anomalous data Well it would depend At least a bit my first question would be how often does it switch? so if it's switching really really fast then and those two You could think of them as independent time series, right? You could think of them as being generated from two different distributions And if it's switching really really fast, you may have some mixing there That the system can handle if it's if it's something where every five days or whatever It just goes from one distribution to a completely different distribution then that's a different problem I don't know if I've ever encountered anything like that in practice. Have any of you ever seen that? No, so I've seen systems that have mixing like you're like your meaning That's usually not so hard to deal with but in a system that goes from one distribution to some completely different distribution I Don't know if I've ever encountered that but let's talk more about it after after the after the session Any yes I'm sorry, I can't do very well That is for what? That's the best part I don't store that's one of the things that helps make it so fast Oh like if I actually wanted to store them, okay So this is the question is like how do you store these time series if you if you want to do analysis on them later? So this this system was basically demonstrating that you don't need to combine the anomaly detection part with the historical Analytics of the time series so you can decouple those and get a lot of good performance out of the anomaly detection part Leaving the analytics off to the side So there are a lot of systems out there that you can use for this So it depends on how much data you have what the time series values look like how fast they are a popular one Is as I mentioned open CSDB if you need it it can be really helpful But you're gonna have to do some work to get it going they'll tell you that you don't so when you go to the H-Base website. They'll say oh you can download this stem see you're laughing as you know They'll tell you can download like this standalone thing and standalone version It's so easy and it's so easy and there's like a nice XML config file that doesn't look too horrible But if you want to do it in production it starts to get nasty So that's one option if you don't have a ton of data you can just use any regular database I mean people frown on this now because there are so many database options, but You can get pretty far with postgres like don't underestimate postgres like it's it's probably a good option But I it would depend on your data volumes to depend on your data volumes. There are some other Databases that have come out recently like influx DB, which is supposed to be tuned for this You can even write your own if you want you can use some log structure merge tree implementation or whatever magic and Build your own if you if you want to do that. I don't know if that's interesting for you But there are tons of options out there Are you smiling because you're like never build your own time series database? Yeah, yeah, it's Yeah, it's not trivial so my recommendation to people for data storage anything is You're probably fine with postgres I've I've known some companies and I've worked companies that aren't but they're they're rare The times where people need something besides postgres are much fewer than the times where they want to use something fun That's new Sorry Can what run on another machine Yeah, so the like in Skyline so that's basically what skyline does It actually runs on a cluster of machines because it's the performance is a little bit problematic But what it does is it just takes the metrics from graphite when that when they come into graphite you just split them off They split off the metrics and send them to skyline and The the system that I demoed actually takes graphite compatible metrics So if you wanted to you could redirect your graphite metrics into the into the demo that I built You wouldn't normally run it on the same machine because in a larger deployment the the Graphite has some performance challenges. So typically you you wouldn't run anything else besides one of the graphite components on that server Time series that Constant number of Views of that side which changes over the day and So the Seasonality basically what happens when you have daily or weekly or yearly or whatever seasonality So there are time series analysis methods you can use for this that will account for the seasonality You can also decompose the time series into like a trend component and a seasonal component and a noise component Analyze those independently But there are quite a few algorithms that will take care of this for you One a lot of people uses whole planters or different kinds of exponential smoothing So it's it's certainly covered in the standard algorithms and actually graphite Also has this has these functions built in so if you're using graphite I don't know this this may be an option for you But we can talk afterwards about particular implementation if you like, but there are many many options Any real questions we have time maybe one more Okay So the the demo system is fully online it's fully online it doesn't store any data and every Event that comes in is processed as it's received and then it's ignored So the demo system is fully online Distribution No, so what it what it's actually doing if you if you want to look it up, it's a Welfords algorithm, I think is is the implementation that they use and it's a it's designed as a one-pass calculation of Variance you get as a byproduct of that the mean and the standard deviation and if you have a one-pass algorithm Then you can use that as a stream processing algorithm Right because you can just consider one infinite pass and then you have you have a stream The problem with this is that if the distribution changes over time You have you have a big issue right because the values that you have in the past are Not properly weighted compared to the recent history. So in reality, you wouldn't want to use this That's why I said don't don't use that you would probably have some kind of sliding window with exponential Reservoir sampling or something like that so that you can still do all of the data collection But you're more appropriately weighting the recent data Did I answer your question? Okay Any other questions Why do they So companies some companies really need this I mean there there are companies out there that really need these distributed data processing systems and for them the The pain of setting them up and maintaining them or even building them from scratch Pails in comparison to the benefit they get from having Those companies are very very few very few so Maybe the question you mean is why are so many companies using these distributed systems if they don't need them and the answer is well, they're new and they're fun and like Postgres has been around for 20 something years or I don't even remember how long and Relational databases haven't been the new hotness since like the 80s So people want to do no sequel stuff and key value stores and all these kind of things and distributed systems are New they're more intellectually exciting for a lot of people There's also some and you're laughing But it's true and like I've literally been in companies where a product owner has gone to an engineering team and said You're going to build this and you're going to use Hadoop because I need to be able to tell people that our product uses Hadoop Because we do big data or something like this So so the reasons are almost never almost never technical Almost never I shouldn't say never because like I said there are some companies that do really require this stuff But they're almost never technical Okay, I think we're out of time, but thanks very much