 So thanks for, thanks for coming. I'm actually, I appreciate the intimate setting here with a low number of attendees, you know, it makes for a little bit more interactive. So I might be probing you guys a little bit. So be prepared to either shout at me or raise your hand or whatever, no pressure. Anyways, I'd like to talk to you guys today about anomaly detection. And you know, as a developer advocate for OpenSearch, I kind of have to go through the new features and kind of learn how they work. And sometimes I'm a little slow on the uptake. So I have to kind of find my own way to learn about how the process works. And so what I'd like to do today is bring you just a really simple anomaly detection workflow and to share with you the relevant pieces of the user interface. I'll pre-warn you. I'm not a math professor. I'm not here to teach you math or statistics or probability. I just only hope to share the UI of OpenSearch with you so that hopefully you're much smarter than me and can apply it in much broader context. So that's what I'm gonna do today and I appreciate everyone coming. Oh, yeah, that's me. I'm a, my name's Nate Boot. I'm a developer advocate for OpenSearch. I've been at Amazon Web Services for six and a half years and I sing baritone in the Rose City Timberliners Barbershop Choir, amateur father and dad joke specialists. Again, thanks for letting me come out here and ramble at you. You're probably not here to learn about me. Let's get to the really good stuff. So I warned you, I might probe. Is anyone familiar with OpenSearch that didn't hear about it at a talk here today? Or was anyone familiar before even OpenSource Summit? Two or three? Two or three? Awesome. No, I'm just curious. Hopefully you'll get a taste for it today. We're an analytics and observability suite but one of our features is anomaly detection. So we'll get some time series data loaded and see what we can detect with some of that moving. I am gonna take a few minutes to brag a bit about the OpenSearch project. We had a year in review at the end of 2022 where we kind of counted our GitHub stars and our number of contributors and downloads. And so at the end of 2022, we actually had a pretty good, pretty good thriving community here, 9,000 stars on GitHub. And let's not forget this is the end of 2022. So this 100 million downloads markets, it's actually up to something like 200 million by now. About 500 contributors from all over the place. And so I just like to share this because the momentum is important to me and I feel good about this because I feel like I helped. So by now those stats are probably a lot higher. And so we're taking off pretty quick. So I hope this is useful to everyone. Like I said, our project will never thrive without a thriving community and we are community driven. So we always welcome those GitHub stars and contributions from everyone. What I'm hoping to do today is to build one of these with everyone. I'll give you a quick run through of what I've got here. So one of the ways that we send data into OpenSearch is by using a companion app called Fluent Bit. And Fluent Bit kind of sends documents into OpenSearch. It also has a plugin called SystemD. Is anyone familiar with SystemD? I guess we're Linux people here. So just in case, SystemD is just kind of a service manager for modern Linux distributions. It kind of knows what the services are on the machine, what their dependencies are, what order they should be brought up in because you don't want to bring up a net dependent service when a net is not up. So it kind of manages those dependencies for you. Anyways, one of the things that SystemD does is it provides a journal. So I'm sure you've heard of journaling file systems and all that, but the journal is a place where all of these logs from SystemD and all of the various services from SystemD centralize themselves. So SystemD has this journal and we have Fluent Bit with an output plugin that goes right into OpenSearch. We're gonna be taking advantage of that today and that will be our ingestion pipeline. Secondly, those documents will go into OpenSearch and then the three components that are part of OpenSearch will hopefully revisit a little bit. We've got the actual anomaly detector, the dashboards panel where we can review them and then there's also monitoring and alerting. Hopefully I'll take you through all three of those with enough detail. But let's start with the ingestion pipeline. We have to get our data into OpenSearch somehow and so Fluent Bit is how we're going to accomplish that. Fluent Bit, interstage lift. Fluent Bit.io is where you can grab that. It's made by our partners, Calypcia. I'm very fond of it myself and it comes with a lot of features out of the box that my neurodivergent mind kind of helped me figure it out a little bit. So when we configure Fluent Bit, this is just a relevant portion of the config. Let's see if I can do a fake laser pointer here. So it's a little small, but the input portion of the config for Fluent Bit, we give it a name for the plugin and we're using the system dplugin so that it's a good thing I put that there. We're tagging all of the documents that are coming out of Fluent Bit with this tag that we capture on the output part of it. I had to strip underscores. Indices cannot begin with an underscore and there's certain fields that also cannot begin with an underscore and so I'm stripping underscores. It's just really a matter of readability to me. I find the distinction to be arbitrary but sometimes there's limitations and we have to work around them. So that was one of the weird things that I experienced that didn't make sense but found a way around in my own way. Of course, in the output portion we'll be sending system delogs up into open-search, so open-search. So we turn on TLS and we turn off TLS verify. So open-search is secure out of the box but the only way we could do that was with self-signed certificates and so we, you know, those aren't offered by a major certificate vendor so we have to not verify those certificates. We stopped using type names a few versions ago so I also have to suppress the type name and of course the open-search was running on local host in a Docker image and it's be, the ingest port is 9200 and I will beg for all of your forgiveness that I have hard-coded admin and admin in my configuration. Your mileage may vary. So once that's all up and going I didn't have the luxury of doing it in real time for you but usually what we can do is on the open-search dashboard there's a discover panel and we can use that to determine if there's data going into our index or not. So if you have a look over here this is just a small snippet of our observability panel or our discovery panel. The green bars represent the quantity of log lines or the quantity of messages or documents however you wish to refer to them and then, you know, here's the data inside of them. It's a little unreadable. I'll help you with that in just a second. But just as a means of verifying that there's actual data going into open-search I prefer to use the discovery panel because you can just kind of click on an index name and then it shows you documents, you select a time range or whatever you like. So those messages that come from Fluentbit take the form of JSON and one of the things that I struggled with is kind of visualizing those. So I took the liberty of stripping out some of those messages that SystemD sends out so that we can scroll it on by and have a good look at it. So here's an example of what might get sent from Fluentbit. The most useful pieces for me here are the host name and probably the SystemD unit or the syslog facility. What that tells me is, what that gets me is a host name that the service is running on and then as well the syslog identifier. So this tells me that SSHD sent a log and said connect a host on port 4000 failed. That's a typical log line to get but it's just going straight into open-search in that form and if you go through the discover panel and select some of these rows they'll have all of those fields. That's one thing that helps me kind of get a feel for what I'm actually sending and I think if you were to look at the logs by hand or use the journal control command you would be able to see a lot of these as well. I find the JSON version much more appealing. So anyways that's what a typical message looks like going into open-search. So we're part of the way there. We've got the ingestion through Fluentbit and we've got an idea of what our data is. Now it's, I can't express enough how useful it is to be familiar with your own data beforehand. That way you can really think about what you're searching for, what it represents and what it all means, what the context is for all of that information. So now that we're sure that we're sending data in and it has the format that we want we can go ahead and create a detector. So a detector is in anomaly detection is actually what we call the process that runs occasionally to detect anomalies in an index. So when we create a detector we have to select a data source and a time frame. The data source of course it's an open-search index and the time frame you saw one of the fields that scrolled by from system D was a timestamp. So we're gonna be correlating against those two things to create a detector. So we select a data source and a timestamp field. Here's what the UI might look like for that. So it'll ask you for a data source and you choose one of your indices. The index that I happen to send my logs to was called system D. You saw it in the Fluentbit configuration stanza in the beginning slide. And then of course this option here, this data filter this really is a matter of efficiency really. System D has a lot of services and if you wanted to narrow your documents down to just say SSHD you could add a data filter query here to say unit file equals SSHD. And then that corpus of documents would be limited to just the ones that have the field service system unit SSHD. And it's really a trade-off of efficiency. If you know you don't need certain data you can narrow it down. And this is why I mentioned it's important to be familiar with your data beforehand. Cause if those mappings and those indices can grow pretty big over time and it's always a good idea to know what you're going to be searching for and to eliminate all the extra stuff. It's like a math test in high school. They try to trick you with all the extra information and then they get kind of sneaky. So the second part is a timestamp field. So we can't have an anomaly without a span of time and then some kind of calculation, some kind of numerical data that we can detect anomaly in. So when you detect a, when you specify a data source and you also specify a time field the next thing to do is, or a data source the next thing you have to do is select timestamp field. So fluent bit adds this in by default. It'll just very nicely throw a timestamp into that message. I think you saw it scroll by earlier. And so that's what we're going to be using as our timestamp field. So every time a message comes in it's going to note that timestamp and use it to detect anomalies. And so we'll be setting some other configuration variables for that in just a minute and that'll make a little bit more sense when I get there. We also have to set the interval and the window. What that means is these anomaly detectors in OpenSearch they're processes that fire off at every detector interval. So if you want to check your index for anomalies every 10 minutes you would set your window to 10 minutes. Now there's also a window delay. Some customers who have very large setups they have logs coming in from lots of places or perhaps even geographically dispersed locations. Sometimes it takes a little bit for your logs to get in. So we can offer up a delay measured in minutes to wait for all of your logs to be fully up to date. If not all of your logs make it in that time you might be doing an analysis on incomplete data. So if your logs take a while to make their way to OpenSearch sometimes it's helpful to create that delay. Sometimes there's lag over the internet or whatever automated system you're sending your data through there's always the potential for lag and physical interruptions and whatnot. So we have to specify a window and a delay. Just how often you want to search for anomalies how long we expect our data to take to get to us in its entirety. So the final part or close to the final part is configuring your model. So we've talked a lot generically about indices and the window and the timestamp and all of the messages we're sending but we actually have to pick something to detect anomalies on. So that's what we're gonna do next. We call them features. I don't know why we call them features. It's just a field in your index. So a feature could be a calculation on say like the average of a particular field in your index or the sum of a particular field in your index over that window. This is where my brain kind of melted a little bit because SSHD or system D it doesn't give you statistical data like that. It's just kind of a log thing. So what I ended up doing and you'll see the UI for this in a minute. I won't leave you in the dark, I promise is one of the aggregations that we can do very fast is just counting the number of log lines. So what I ended up with was all of this data but with no real statistic to calculate and that was weird. So I decided to just count the number of messages. That added up to something interesting to me although not definitely practical for everyone but like I said I have a unique way of learning these things and so this is how it all made sense to me. So we pick up to five features to detect anomalies on. I'm only picking one but then we also pick categories and then you can choose up to two categories and this changes your anomaly results to be a little bit of a heat map where one feature is an x-axis and the other feature is a y and it aggregates across them and provides you a little heat map based on anomaly grade and anomaly confidence which are two things that I'll get to in a minute. There's something to take into account here. We had a trade off of efficiency before. We have another one here where the more features that you choose, the higher cardinality of the results that we're going to have to search through. So if you're trying to aggregate five different fields and find out what's anomalous between them it's a little bit more difficult than just taking two. You know the sample space of the data that you have to look at grows exponentially the more fields you have to add there. So that's another part of it is making sure that you choose only the fields that you need to detect anomalies on otherwise you might be kind of spinning your CPU more than you need to. Like I said, we can't create anomalies on data that's not time series. We have to have a time stamp and a value of some kind. System D doesn't give us a value it just basically gives us textual information. So I'm just counting one of the fields that I'm certain is in every single message. And so it's like counting the number of rows in a database. It's just aggregating on the count of a particular field. So this is what it kind of looks like. We have to give our feature a name. Boy, that's kind of small, sorry about that. I'm calling my feature the message count. I'm enabling it. And so we do have this checkbox here for enabling it. And that was very helpful to me because as I'm clicking around in here basically screwing around as trying to learn this I didn't know what it was for. Why would you want to create a feature and not enable it? Well, it's mostly for testing. And it was helpful for me because I click a little fast and move a little fast. And I wanted to be able to make these on the fly and then see if they gave me the data back that made sense. So if you wanna make a bunch of models and then enable them or disable them I found that to be particularly useful testing things. Just because it all takes a little bit of experimentation and a little bit of arbitrary creativity. And so this is, I mentioned categorical fields to create a heat map of things. And so in this case, if it's legible there, the two fields that I've chosen are host name and system D unit. And so that's gonna resolve to a host name or an IP address and the system D unit that's causing all these log lines to run. You can't change these after you create the detector. So you can create as many as you want. So I think you would have to create another one. And that's just how she goes. I think that's, to me it makes sense because if you wanna detect anomalies it has to be readable. You have to be able to skim it. And the more stuff that's in there, the more things that it touches, the more difficult it is to really make sense of it all. The last thing that happens when you create a anomaly detector in OpenSearch is init. In init never changes. It's just hurry up and wait, grab a coffee and just sit there. Hurry up and wait. But the cool thing is when you're done you get to revisit your anomaly detection panel and see what's coming in. The panel revisited. So if you go to the anomaly detection panel in OpenSearch, I think a couple minutes had passed since I had made my detector and you can see kind of some anomalies coming in. This represents an anomaly and these are the heat map categorized ones. So it's a little small but you can see the local IP host name right here and then SSH service. You'll notice that it took very little time to see an anomaly in the SSHD. So if you could picture a freshly spun up EC2 instance with the Docker image of OpenSearch with port 22 open but key authentication enabled, it doesn't take long for some 13 year old script kitty to come by and try to steal my credit card information. So I kind of expected that the SSHD service would spin up pretty quick. And so it's not gonna tell you what's happening but what I can see is that, well these services are a little busy, maybe I should have a look. And so that was useful to me and like I said, your mileage may vary, it may or may not be useful to you but this is how I imagined it in my head and it makes sense to me. After a while, you start pulling in more and more data, you get more and more diverse results and then your panel kind of ends up looking something like this. So we've pulled in documents for all these different services, debus, cron, SSH, init, snapd, and so as the number of log lines increases and varies, we start seeing these heat maps and the heat, the color from white to red represents the anomaly grade or how serious the anomaly is. And so that's important to know because we have two things what are called an anomaly grade and a confidence grade. I think this is a pretty good analog here so if I don't know anything about blood pressure or pulse rates or anything like that, if I were to count my heart rate, I've only got one sample. I don't know if that's weird, I've never checked it before. So the more times I check my pulse, the more familiar I get with the highest it's ever been, the lowest it's ever been with the average number of beats per minute is. So you can actually kind of see this in action here. It's just barely visible, but this green line with the dots, that actually represents our confidence. As we get more information, it's actually increasing. And if you were to increase the span of time of data that you're reviewing, you would see that confidence going up over time. So the more times I check my pulse, I know what it's like when I run, it beats really fast, when I'm tired, it beats really slow and I become more confident as to what's different. If I suddenly measure 400 beats per minute, well, I'm calling an ambulance. I wouldn't know that if I hadn't checked a whole bunch of times before. Was that a good analog? Did that make sense? Everyone thumbs up, thumbs down? Awesome, rad. So anyways, that's what the panel looks like when you click through on one of those heat map categories. And you can see the anomaly and it's on a scale. So we have also an anomaly grade, which is just how far off is what is expected. And that can give you a severity or just some kind of gauge as to how far off the number is. So this one, it showed some kind of anomaly at that time. And I'm willing to bet that since this is for an SSH service, there was probably some invalid login or someone trying to use some kind of attack. But it just means there's lots of extra log lines there that weren't there before. Something's going on. There's a final component that's related to anomaly detection, but only in the sociological sense. Is anyone in security or anything like that or use some kind of other anomaly detection? So if you're in security, you don't sit at your desk waiting for someone to try to break in and then send an alert. So we have an alerting plugin that goes hand in hand with the anomaly detection. Oh, there's nothing on the screen. We have an alerting plugin that goes hand in hand with anomaly detection. And so it's a plugin in and of itself. You don't have to use anomaly detection to set up alerts, but it is helpful because I like to know when an anomaly happens. So alerts require a channel. What I struggled with before I came here is that, we have a lot of different output types for alerts. We can send a Slack message, Chime, custom webhook, email. All of those require all kinds of different parameters, ARNs, IAM profiles and whatnot. And Chime has its own webhook URL and Amazon SNS. Of course, you'll have to give some kind of Amazon credential if you want to use the simple notification service. But the idea is that you don't want to just have anomalies exist in a vacuum, you want to know when they happen. I do. And I know I said, no one sits around just waiting for anomalies. I actually, I kind of did when I was making this because all the data came in quick. And I knew that someone would try to break into my EC2 instance pretty fast. So I didn't have to sit too long, but I did kind of sit around waiting for it. There's a specific configuration you have to use when you're going for an anomaly detector alert. And it's a per query monitor. If you select per query monitor, that's the only one where you see anomaly detector show up. An anomaly detection result is nothing more than another document in your index that has a particular format. So it really is a per query. You're just querying it for anything that looks like an anomaly detection result. And if you pick the defining method, anomaly detector is the one you wanna use. And of course in the detector, you have to pick the detector that you made in the beginning. We named it AD system D. And then probably most relevant is the schedule. How often do you wanna be bothered by your anomalies? Well, it really depends on what it is that you're checking for anomalies for. Like if your credit card has an anomaly on the number of dollars spent in a 24-hour period, I'd want a phone call. Or if your blood pressure, and if your doctor measured your blood pressure, you'd want him to tell you instantly. So depending on your workflow, you really wanna set the schedule of your notifications to be on a, matching how often you wanna be bothered. Or notified, it's not necessarily a bother. And so each of those, when you set up your channel, you can set up the trigger that sends the notification. And this is where you actually get to tell it, if the anomaly grade is above a certain threshold and the confidence is above a certain threshold, that's what would trigger it. And then based on those two things, we give it a severity. And you can also use another format besides that. But this, under trigger type here, the trigger type is anomaly detector grade and confidence. But there's another option there. It was way too complicated for me to figure out what I was supposed to put there, so I didn't take any pictures of it. Sorry about that. So I appreciate you guys letting me come out and core dump at you. I'd like to remind you that the only difference between screwing around and science is writing it down. I went through a little bit of a lopsided process to go through anomaly detection, but I kind of recorded my steps and now I've got a story to tell about how I learned about it. So really, it really demystified it for me because I saw exactly what was going in and at my own pace, I clicked through it. And so I hope that helps. I don't like saying that I'm slow on the uptake, but I take a while to get things solidified in there and I have to play, I have to screw around. And sometimes that leads to pretty good stories where I actually do learn things and sometimes it leads to embarrassment. This one was a good one for me. I wanted to give you guys a call to action before I sign off here. We have a playground where you can come and play with OpenSearch if you like, playground.opensearch.org. And we also have a QR code that will take you there as well. I humbly encourage you to go click around. If you don't know what something does, all the more reason to click on it. Everyone get the QR code? Wonderful. That's all of the blather I had for you. I'd like to offer the time up for some questions or if anyone has any comments or criticisms. Has anyone set up an anomaly detector before? Yeah? What software suite did you use or how did you do it? If you don't mind speaking, it's okay. Yeah, and that's kind of the same principle. You're looking for thresholds, data that's out of the ordinary. You can define an anomaly as just anything out of the ordinary and sure. Well, we're very similar to the ElkStack and so I think both solutions would be just fine for detecting anomalies. I think I would probably respond to that by saying we have a community-driven product here. So if there's some feature that you do need that it does not currently have, we have bi-weekly community meetings. We've brought all of our backlog and triage into the public over Zoom and you're welcome to file an issue in GitHub for just about anything. So I always like to say that what makes us stand out is the fact that we're community-driven and fully open source. Sure, go ahead. Sure. The algorithm that is used is called a random cut forest and so it kind of analyzes a spot and all of the features and things around it. I don't think I could, I can't really think of an analog for that kind of algorithm or some way to abstract it but I'm sure if you looked up random cut forest on the internet, you would get a pretty good answer. I'm happy to put you in contact with someone who can give you an awesome answer there. Someone much smarter than me. Go ahead. And I believe that there is. You can set up an alert based on a bucket and so as you aggregate your results you could probably create a bucket and send an alert based on the aggregate there. Well, yes, actually historical data is a good point. When you, if you have an index filled with data, when you start the detector, you can actually tell it to analyze the historical information there and it'll actually, instead of at regular intervals, it'll take the bulk of the entirety of what you have in there and go through the anomaly detection process on it. What that does do is provide more confidence than it normally would have because it has some data to work on. Like I said earlier, throughout my life I've checked my heart rate many times and I know from doing it so often what ranges to expect. And so if I were to have kept a record of that I would have an even better view of what is normal or abnormal or however you'd like to phrase it out of the ordinary. Yeah, sure. Yes, yes, we have a RESTful API that you can use both to create detectors and manipulate them. Sure, go ahead. Gotcha, I think the main problem that that would introduce is probably lag from all of those disparate logs making their way into open search. So if I'm reading the question right, I think what you might like to take advantage of there is the window or the delay. If you know that it's gonna take five minutes for all 10 minutes worth of logs or whatever your window is, you wanna set the delay long enough so that you're sure that all of your window's worth of logs are gonna make it in there. That way you're not detecting anomalies on an incomplete data set. How's everyone doing? Oh, one more, awesome. Yeah, it analyzes the last window's worth of data. So it's not a single point in time although there is something to take into consideration there where if you have a feature or some kind of calculated statistic that just bursts way up and then falls right back down, it might happen too fast to be detected. And so in that case, you'd have to adjust your window or how often it fires off to kind of catch those. So that's another one of those interesting trade-offs where the more frequently you check for anomalies, the more CPU usage you have to use because those things are firing off quick and then the more detectors you add, that problem compounds because you're firing off multiple anomaly detection processes checking for multiple different features and it gets, you know, it gets hairy. Yeah, that's correct. Wonderful. Well, I don't see any more curious faces out there. I'd like to thank everyone for having me out here and for coming out at the very end of Open Source Summit. I know it's really hard, everything ending at three today and everyone's probably tired and jet lag. So you have my humble thanks for coming so late at the last day to see my face ramble about anomaly detection. So thank you very much.