 Alright, like I said, thank you for coming today. Welcome to machine learning for Kubernetes logs and metrics. I'm Libby Schultz and I'll be moderating today's webinar. We'd like to welcome our presenter today, Larry Lancaster, founder and CTO of Zebraium. A few housekeeping items before we get started during our webinar you're not able to talk as an attendee. There's a Q&A box at the bottom of your screen. Please feel free to drop your questions in there. We'll get to as many as we can at the end. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct and please be respectful of all fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF webinar page at www.cncf.io slash webinars. With that, I will hand it over to Larry. Hey, thank you so much. Hey everybody. So today we get to talk about something that's near and dear to my heart. Machine learning on logs and metrics in particular, you know, when we're deploying through in Kubernetes and so with that let's get started. So machine data is my life. I've wasted far too many years dealing with machine data at a number of companies and a number of roles. You know, at the end of the day, what I've found is that there's a lot of value in telemetry that can be hard that can be gotten out of telemetry but it really is, you know, it's a long mile to walk to, you know, to pull the value out and so, you know, kind of, kind of, you know, the vision that I thought became realistic, maybe five or six years ago is really, you know, is it possible to do a better job with machine learning such that there's not so much work to do to get the value out of the telemetry. And so really that's kind of what we'll be talking about today. So I'm going to frame kind of the problem, the problem space, you know, as I see it in terms of how that's converging so the 20 years ago if you think about it, you know, there was there was shrink wrap software. You know, you had, if you had an incident, you would have one affected user, you would have one monolithic application, you might have, if you're, you know, stretching it you might have 10 different log files to look at. And so the way you would do that for if you're digging into root cause is you would index those if you're lucky, you may not even have to you may just have to search them they may be small enough for any given incident, you know, volume of data isn't even a problem. Right. And so it's, it's, it's always been true, as long as I've known it that, you know, for detection, you know, of incidents, sometimes metrics are the best way to go but for if you want to get to the root cause of a new incident, something that's new a new problem chances are you're going to end up in a log file at some point and so, and so, you know, this is kind of an important, this is an important sort of piece of the workflow of incident management and triage right. So if you compare that to today, you know, you have a sass world. So now you've got one operational incident, maybe you have 100,000 users affected. Maybe you have 100 services that could potentially be taking part. You have, you know, 1000 log streams coming through with all this telemetry and so using the telemetry becomes inherently more difficult. So, but, and if you see kind of how, you know, we, you know, people are using that telemetry today, oftentimes it's still the same, the same approach. So the, you know, the question is, you know, what do we need do we need something better and if so what kind of what is that. So, so to kind of step back and look at it if you take a look at sort of the state of DevOps there was there's a report done annually they kind of tries to survey what are the trends in the field. And, and one thing they got called out in 2019 is because of the complexity of a typical deployment today. Sort of MTTR has sort of plateaued. In other words, even sort of the elite shops have kind of reached the limit of what they can automate by, you know, scripting rules and so on. And, and so, you know, what what's typically driving the MTTR now is new, new problems. And the reason for that is the complexity of software today. So, so something something needs to break. So, so, so the, this is this kind of is to reflect again in the same report. I mean, what you're seeing is no matter if it's, you know, sort of a small shop or or or a large shop with an elite team. And so in this incident, the new thing, the unknown problem has become sort of the driver of, of downtime today. So our vision is that you know autonomous root cause analysis will have to save the world from this, this problem. And when I say autonomous what I mean is, you know all that slogging through data and trying to find what you're looking for when you don't have a rule in advance to find it. So machine learning needs a lot of automation and the problem is you can't do it with a with a handwritten rule that you've constructed. So, to me that's why machine learning on telemetry is very important and is probably going to become a lot more important in the coming years. So if I think about what kind of a tool what I would want from a tool that's going to help me in this area. You know what, what really are the facets of that. Well, for one thing it would be nice if it automatically detected new problems. And we can talk more about that later. That would be definitely a good thing without setting up alert rules. You know, we often have alert rules already or we have tools that are monitoring and catching problems in which case those tools can do detection for us. But even in those cases, I won't help finding the root cause without manually searching for what I don't know what when I don't know what I'm looking for. So what kind of requirements would be reasonable for such a thing today. And this is kind of where, you know, I think Kubernetes has really, you know, sort of presented a whole new wealth of opportunities for everyone, you know, it's an amazing, it's an amazing ecosystem. Someone can come in and they can deploy our software and collectors for example, they can do that with a couple of charts, or they can do it with you know a couple of cube control commands it's just absolutely amazing, kind of the, how little configuration can can be required to deploy an application and it's I think. So while there's been a lot of complexity that's come along with sort of the microservices environment and sort of the decentralization of software. There's also been a birth of new flexibility and coming out of the metadata, mostly that that these deployment systems contain. So I'm going to want to be able to monitor, you know, an arbitrary application and the truth is, you know, it would be nice if all of our, you know, software was sort of running in a, in a JVM somewhere, but that's not always the case. We need arbitrary run times to be supported and arbitrary infrastructure, you know, it could be, it could be that I'm, you know, I need to be monitoring, you know, a Linux instance in AWS or, you know, some bare metal server somewhere so there's a lot of combinations now that kind of form that complexity that we talked about a minute ago. And so taking these things into account. You know becomes important if you're going to do, you know, effective sort of automation of root cause. Another thing that that comes up is it's interesting if you take one application stack and you deploy it in one environment. And they say for example an Atlassian stack which you know a lot of people use. There's, there's still a lot of people using it on on prem or at least in their own VPC. And there will be for a long time because you know it has it has source code in it and some people just start, you know, comfortable putting that in a cloud environment. And there's a number of applications like this sensitive databases and so on they can run in all kinds of different places. So the way that you're using them for example if I have this database and all it's doing is serving reports out. You know to a sort of a business intelligence group or something that's one way to use that database another way is, you know, you're backing some web facing services with that thing. And so, so, depending on how it's being used that's what I mean when I say an arbitrary environment. But the patterns of usage, the kinds of events that are going to come through in the logs, the way they come through the patterns that they form the regularities sort of the regular heartbeat of that stack is going to be very different. And so, it's, it's really tough to have kind of, you know, the model where I'm going to go, quote unquote, learn about postgres logs and what's what's normal and not normal. You know, in environment a, and then I'm going to take that learning and I'm going to distribute it to 1000 users who are using it in completely different ways. But the complexity, you know, from the perspective of doing machine learning in that case is there needs to be a lot of sort of custom learning that's done in whatever environment this solution is going to be deployed in. You know, when I talked a little bit about zero required sort of configuration, you know, for setup, you know, and how useful kind of Kubernetes has been for deployment in that respect. There's lots of other things that you don't want to require. You don't want to require someone, you know, and then user to sit down and train a system. Oh, yes, no, yes, no. Here, you know, here's a thousand log events. Is this interesting? Yes or no. We found at least that people aren't willing to do that kind of work. So that that also kind of kind of bounds your solution space, you know, quite a bit. There's a lot of things you don't want to require of a user to get started, even though they may very well end up wanting to do some of those things, you don't want to have to require them if you're going to call your application sort of autonomous. And so, so in that sense, you know, is it really too much to ask. You know, I think, I think it's, I think it's gotten to the point where it's not too much to ask to immediate to be able to see real value. It's always going to be sort of a period of adjustment and a period of sort of, you know, going in and kind of, you know, giving some amount of course screen feedback you're always going to need that you're going to want to let people bring their alert rules and I'm not saying these things aren't valuable. In fact, in some cases they're critical, but your system can't require all that stuff to show value would be my contention. So, you know, you know, why do I need, why am I saying that we need to be so, so flexible in that sense. I guess because, you know, if I'm if I'm requiring sort of a person to go in and do training, that's probably not going to scale indefinitely. And my assumptions if I make any about a stack may not hold. So given that, where do we start so let's say you know we're doing a very general sort of machine learning task, you know on a set of data set of telemetry. Where do you start what set of telemetry would you start with. So everyone may have an opinion on that. My opinion is that if you're targeting root cause analysis. Then you have to start with logs and the thing about logs is, you know, they're difficult, you know they're difficult to work with. But, but there's reasons that logs are so valuable when you're causing a new incident type right. So for one reason for one thing logs are kind of self describing at least the free text kind. This is Gavin don't forget to hand over to for the demo. Oh yeah yeah yeah thanks Gavin, I'm going to do that right after this slide. Thank you very much. So so a free text log is going to tell a real story about what's happening. And so, if you look at these log lines I've put here, as an example. And what you'll notice is that, you know, at least if you have kind of some experience if you read these, they kind of tell you what happened without a lot of sort of rules having to be behind all that you don't need a lot of metadata. Right to tell that story out of the text. So actually, you're right Gavin I think I skipped that spot I was going to stop a couple slides ago. So so really quick guys, I'm going to let Gavin jump in Gavin is our head of product and he's going to start up the Kubernetes demo. And then, and then at the end we're going to kind of look kind of how things panned out so hold on let me let me go ahead and stop sharing Gavin apologize for that. Thanks Larry so as you just mentioned I'm going to run this demo in in two parts. And the first one that I'll do now is I'll just show you the demo environment that I'm using. And then I'm going to break it. And then the second part I'll come back and I'll show you what the Zebra machine learning picks up. So if I share my screen now. I'm going to take you into the Google Clouds demo microservices app so it's a little web app. It's running a shop I'm going to purchase a barista kit as we speak, just to show you what it looks like running. And there we go. I've also got a whole bunch of services running Istio Prometheus and Keali. And we can see in Keali what the data flows look like and what's going on on the network or a service mesh here. So you can see there's a lot of activity we have a load generator that's doing quite a lot. And that's all very good and well. And then what I've done just prior to this is I've also installed the Zebra and log and metrics collectors. So the way that I did that and the reason that I did it prior to the webinars I wanted to give it a little bit of time to fill in the basic patterns in these logs. So the way that we sign up for Zebra is either that our Zebra.com page, click get started free, fill in your name and so on and then set a password. And what that does is it will take you into a screen that looks something like this. Okay, so now what we do is we install our connect collectors and there are pre built commands to do that with your auth token. So for Kubernetes for our log collector, Helm three, create a name space and then install a collector and really the only thing I need to set is a deployment name. Similarly for the metrics collector. We run these two helm commands to install it and I'm going to use would use the same deployment name. And that essentially sets up Zebra now to receive your logs and metrics. So what I'll do now for the grand finale before I hand back to Larry is I'm going to go ahead and break my application. So if you see here. These are the pods that are running for the app. And I'm going to essentially kill the product service catalog pod and the scale it down with zero replicas. So we're going to see it should die in a moment busy terminating now, and it should disappear and if I go back to my web app now and try and do something. It's not working particularly well. And then you'll see in a moment. Kiali. Okay, it's it's it should start to turn red as as things start to it starts to detect the things fail. So essentially my app's completely broken now. What Zebra and we'll see is a change in patterns coming through with absolutely not built any rules to detect this kind of a problem. So I know it's a fairly trivial problem but just bear that in mind. So let me hand back to Larry and then I'll come back towards the end and I'll show you the incident that Zebra and should pick up from that. So Larry back to you. All right, that was fantastic. That was fantastic. Gavin thanks. So what's interesting about that, that demo that that that Gavin started. I think it may have been two weeks ago or a week a week ago. Actually there was a service provider who just went in and they signed up and they decided they were going to just see what happened if they did X and so we thought it was it was so simple and cool to show that that we ended up kind of appropriating it Yeah, so so so let me let me jump back into the logs now because you can imagine all the logs that are going through the system right now. And you know so a free text log it you know it kind of tells a story and this is kind of part of what makes logs really useful for root cause analysis. I have to wonder why, like, in general, people don't use them much for monitoring and, and I shouldn't. It's not true that people don't use them at all for monitoring there's there's plenty of alert rules that are built on logs all over the place. It's just in general I think kind of the, you know, the direction for monitoring at least to find thing when things are broken is to use metric alerts and part of the reason for that is because, you know, logs are generally higher volume. And there's there's some other sort of conceptual problems with logs that make them difficult to work with. So if we think about what long monitoring tools look like today. Generally, you're going to sit down and you're going to, you know, you're going to build all of this sort of automation around it. And it's so it's a very tedious and kind of manual process. But, but still we we know that just under the surface when manual work is applied, there's a lot of value there. Right. So, so, you know, when we think about, you know, kind of what makes what makes them difficult to work with so let's let's say I'm trying to root cause some issue that just happened. You know, it's, it's kind of slow and painful to do it. When I'm having to search for keywords, and I don't know what they should be. So maybe I'll look for, you know, Oh, was there a spike in log volume. You know, maybe that's going to tell me where to look first but, you know, then you find out. Oh yeah, you know, the young update ran in the background and that's spike was from something else and I don't know where to look and it's this. Let me type in things like fail, you know, bad abort whatever. So you try like trying to find trying to find, you know, at least the place to start looking right it's a it's a it's a real it's it's a real pain. They're fragile right formats change so you know I mean I don't know how many times it's happened that I've that I you know that I've experienced this but you know you'll you'll set up a rule on some logs and then you know you'll go away you'll be happy with yourself. You know the next time that event happens you're going to catch it and you're going to do something special with some value that's, you know, parameter in that log event or what have you. And then, you know, someone who doesn't know you an explanation at all. Someone upstream decides you know they're going to do something really helpful and nice actually which is to fix a spelling mistake. And the next thing you know, your little rule kind of silently silently breaks. And so, like, these are these are the kind of frustrations that log monitoring, you know, kind of surfaces. And finally, it gets annoying, like there's, if you if you try, sometimes what'll happen is if you try to set simple rules. You know, you'll, you know, you'll sort of blow up with, okay, I'm going to do this every time I get an error. But but you know now you know I did, I deployed some new, you know, kind of part of my stack or some new version of something or something completely irrelevant happen something spewing, you know, hundreds or thousands of events, error events into the log that really doesn't matter. And now what is now I have to write rules to kind of suppress that I have to buy an AI ops tool to try to put them into one thing without ringing my pager all night. It's, it's, it's a real, it's a real pain so so you know to me this is kind of logs have been kind of stuck in this, in this rut for a while. And it boils down to you're stuck in the index and search kind of mentality and and what that what index and search is is just ways to kind of speed up the manual work. And so as long as you're doing the manual work, you're going to require to be required to maintain that work. You're going to be subject to the limitations, kind of your own of your own processes you're trying to just, you know, look around and find what's going on. So it's just, it's kind of like it's kind of a self limiting sort of approach. We talked a little bit about this already. And I did mention, since apps. Yeah, so applications are bespoke. So this is actually an important one right. So like let's say, you know, someone gives me some package that's going to look for errors, you know, in postgres and then I decide to deploy, you know, since I have postgres in my application, I decide to kind of use this machine learning package on that. What may end up happening is that's great but now my application is bespoke. We've written it ourselves. It's in go. It has its own logs. What those logs mean and what's normal and abnormal are completely custom to me. And so the sense of that's going to have to get learned on my data and there's not going to be some giant multi P to bite training data set to do that with right. So all of these make it sort of difficult to apply, you know, machine learning in general to actual sort of monitoring and root cause problems. So at this point, you know, now I've kind of thrown out my hands and despair. So let me step back and think about what I what's the very simple essence that I of what I actually want to do with these logs, and is it possible. And so, so the way I think about that is the junior SRE problem. Right. So let's say it's, it's day one. So in your SRE, you walk into a shop. There's a few things that you are familiar with in the stack. There's a giant wad of stuff you are completely unfamiliar with having never worked in this with this application or stack before. And, and, and over time you start to learn, you start to learn what's normal. Right. So I've done very simple, simple sort of terms at least I do. So I should, I should point out this is kind of my approach to the junior SRE problem. But really I'm looking for two things and my experience starts to crystallize around two very important sort of recognition tasks. One is, I need to be able to recognize when something is bad. And I know that sounds trivial and silly, but it isn't. You know, there can be errors and warnings getting speed that don't matter at all. But as I get to know my application better, you know, I'll start to realize, you know, that's that when something bad and is actually happening, you know that I'm going to care about. And part of that recognition will end up being sort of how widespread is the sort of, you know, observed badness as an example. You know, let's say, you know, I've got, you know, I'm looking at, you know, one log or one container. You know, type one one service and, you know, it's got some regular cadence of errors, but then I see there's a few fatals and at the same time, I see some errors cropping up somewhere else in another service. That's going to be a key that something bad is happening right so you get this sort of correlation across log streams that's important. And something else I think that's useful in finding root cause like getting to the bottom of things is having a sense of what's rare. Like, hmm, I've never seen that happen before. Right. And, and that sort of thing, I think, is equally critical to root causing a new issue so they just these two concepts. What's bad, and what's rare, if, if you can start to get a handle on those things and figure out when they're happening around the same time. There's a chance of surfacing root cause information that'll be useful. So that's kind of the approach we're trying to take as kindergarten as it sounds. So I can tell you a little bit about how I'm going to talk a little bit about what we're doing, but you know, there may be other approaches and I'm going to discuss some of those after I get through a few things about how we approach this problem because there right now there really is no one approach to this problem we think that we have the best approach but you know that's us I mean there's there's other people tackling this problem and I want to talk about that. So, so we do a complete relational structuring of logs, and we do that at ingest. So it's not like there's some batch process that's going on and trying to figure out what, what are the different event types and what are the parameters and the events and all of this out of my, out of my piles of log data. It has to be done at ingest. And then there's a number of reasons, you know why that has to be done at ingest, but probably the most important reason is, you know, the most important events are going to be the ones that I haven't seen very often. And, and, you know, back to our rareness criterion, if when I see something new, that's when it's going to matter the most. And so, you know, if I'm going to do something like structure these logs that's all great. But it better do something reasonable with the first or second occurrence of that. And so that's, that's kind of, I think, you know, appropriate, you know important for us as we go through this. So the idea is very straightforward. And you know this particular log I'm showing here I'm just pulling stuff out of a mocked up JSON I've never seen along the JSON log this simple before. I'm sure they exist. But, but you know that trying to get the concept across. And in fact, the more free text. This log is, I've, I've found that that out both our approaches as well as other common approaches are more it's easier for the it's easier for machine learning approaches to structure them naively. And the reason is, ironically, the text of the log message contains a lot of locality in it that gives you information, you know this this clump of this clump of tokens. And it says, you know, something bad happened that that as a whole phrase means a lot. And then the next thing will be sort of a parameter and that parameter will always come after that phrase and so if you can pull out those, those pieces of those events, you will have a good chance of getting meaningful sort of event types and so on. And the way we do that is right up creating these columns for each parameter and so grab all the stuff over here and then we'll grab the stuff over here and put it in into these columns. But the most important thing that that's coming out of this is we know what kind of event this is. In other words, there will be one event type. When I see a log like this, it's this type of event. And that's probably the most critical thing to walk away understanding is, you know, if you're dealing with a structured log. One of the first things you have to find out is, okay, hopefully they've given some context, some, there's some context take, take concept taken into account of what kind of event is this, there's probably an attribute that will embody that. And that's what you need to hone in on. So, so given, you know, sort of given that, you know, let's, let's apply the requirements from earlier we don't want to have to assume that we know the prefix formats, we don't have the logs we don't want to assume we know the grammars. We don't want to assume that we need to know keywords because again, this could be your own bespoke application. At least there's not going to be any known prefix formats or event grammars the system has to learn all that right. And if you're able to do that you can embrace free text logs. So, so now the kind of structured the data really sort of strictly, you can do an omni detection on it really well if you can really do that. If you can really tell that this class of events is one. So this is another kind of event and this thing that just came in that I've only have one or two examples of is a separate kind of thing and, and it's, you know, I still know enough to know it's different from something else even though I may not have figured out every parameter and so I may be able to cluster those. So, so there's kind of a sort of a gradient of you know, heuristics to clustering to classification and depending on how many, how many examples you've gotten of an event type. Each of those phases is going to have a greater impact, or, you know, one of those phases can have a greater impact on the structure that you've determined. And so, so once you've done that you can start doing really cool things like saying, you know, back to our kindergarten observation, boy, I haven't seen this in a while. Well, that's because you haven't seen that event type in a while, or you may see two relatively rare events happen more closely in time than they ever have before. And so you start to sort of imagine this for every log stream, right, you've got a set of event types that every event could belong to. And you could you sort of conceptualize this as sort of a point process and you say, okay, you know, usually the rate of this kind of an event in this stream is this and the rate of this kind of event in this stream is that. And all of a sudden, you know, maybe I have upticks in the rate of those maybe I have very tight correlation in the time occurrence of those that I didn't use wouldn't usually have. And then you can imagine kind of expanding that to say not just event types but also sort of severities and errors. You know that all of a sudden the rate of errors here went up and not only that those events are very correlated to events I'm seeing in the stream over here. So, so really, once you've taken that first fundamental step of structuring the events into event types, you can start to do this kind of a correlation analysis. And really to us this has proven to be kind of a transformational step, you know, if you do well enough on to you can start to do, you can start to do well enough to to sort of identify what I would call incidents or at the very level of clumps of stuff that's going to get you to a root cause indicator. Right. So we, you know, I think we already talked about this but there's a bunch of stuff we're just, we just can't require in order to do that correlation upfront. So, you know, it all, at least from our perspective, and it may be to a fault, you can't have any, any rules built into this system that know to look for this keyword. If you do any of that sort of stuff, it may work in this one instance, but it's not going to generalize. So if we can, if we can make this this approach work with completely naively with zero understanding of the semantics of the data that it's ingesting and creating these models on. We know we can have confidence that it will work on any application or stack. So, so that's kind of been our approach and I think that discipline is important. Right. If you start, well, you know, next thing you know you're, you have a, you have a database of 1000 alert rules and, and if someone buys that they're no better off than they would have been if they had their own database of 100 alert rules right so so that's that's one thing it's important to avoid. I want to talk a little bit about other attempts that have been made to structure logs for for root cause or for just for sort of detection monitoring. One set of approaches is a deep learning approach. There's a number of papers on this. There's an academic community that's, you know, kind of really has been interested in that. One thing I would say about that is, you know, it's, there well there's a couple things. One is, if someone's going to give you their log data into a SAS service, especially cost is going to be important to them, you know, you're not going to be, you know, racking some refrigerator sized sort of appliance to do to do actual deep learning on every data set you get. At the same time, I've touched on this already but it's very difficult to take what's normal in one stack in environment and generalize it to another, depending on exactly what the stack is made up of and exactly how it's used normal can be very different things. So there's a number of sort of conceptual challenges like that, that I think sort of, I would say they're in the way. I do think that they will come when when deep learning approaches will be very successful in in tackling telemetry from a naive standpoint, I don't think we're quite there yet. In fact, some of the natural language models are probably closer than sort of the deep learning models that have been more popular in the literature for the last few years. In any case, another thing that you'll see is the use of a particular algorithm and that's usually LCS. So for those of you in the in the audience that care about such things, longest common substring and and sort of different implementations of it some are online, some are batch. But essentially the idea is, you've got this algorithm and it's going to decide what your catalog of event types is the problem the weakness there's a couple weaknesses with that algorithm. One of them is it doesn't really have sort of an innate sense of types. So depending on your input implementation you kind of have to, you kind of have to build in some something to be able to tell these are different. These are different things because this thing here is always an integer and this thing here is always a file. And so, so that's important but I think a more a bigger barrier more conceptual barrier is not a good structuring out of LCS. And you see this in a number of sort of machine learning for logs packages that are out there on the market today. It takes a lot of examples of a given event type to do a good job with it otherwise it gets put into this other bucket. And so, because of the Pareto nature of logs, you always end up with some massive swath of your log data and it's the most dense in this other bucket that haven't really been effectively categorized yet. So I think it's in practical reality it's important to have a continuum of approaches you're bringing to bear, depending on the cardinality of each event type that you're seeing. So, you know talking again sort of summarizing, you know, you got to structure first you got to do it in line at in just time otherwise you can't respond to an incident. You have to have a multi stage structuring pipeline at least in our view to respect the Pareto distribution of event types in in real world logs. The good thing about using a correlation model once you get past the structuring and you start doing this incident sort of detection and cause report generation is the more data sources the more streams of logs and or metrics I haven't talked too much about metrics but you'll see a little bit of that in a minute. The more of these streams I have with anomalies that I can detect the better job I can do of cross correlating and picking out a point in time I get better resolution the more data I have. And that's an important. That's an important I think dimension to an effective solution. And certainly you'll see an example of this in a bit but if you're going to pass and so a lot of you may have heard GPT three it's a natural language model there's some competing models there's some some free, you know downloads of similar models you can, you can download pre trained or you can train them yourself. It's really a very exciting area of research right now. So if you're going to pass in a prompt to one of these things and you're going to get back something meaningful, the data you pass in has to be concise. It can't be, you know, 10s of kilobytes or hundreds of kilobytes of data, you have to pass in a few things with enough natural language keywords in it that this thing can take a stab at responding to that prompt. One thing we've learned is it's important to be able to get sort of a root cause for analysis report that's small enough that a human can read it and digest it but also that a machine can do that. If you want to even automate sort of the summarization part. With that I'm just going to walk through a couple pictures and hand it back to Gavin. So here's an example where we had a stack and we had some incidents that were detected. So what you're seeing here is the color represents sort of the, sort of the severity of the event green being, you know, warning and blue being info or no green was a debug and more and blue was info. And then you've got your warnings and your errors, which is the yellow and red. And so in these different services here. And, you know, what we have is your minute by minute time and we're kind of, you know, what we're seeing in terms of the size of these things is kind of how, how new they were like in or I should say how rare they were so kind of the inverse. It's a representation of the inverse of the typical frequency of these events. So what you're seeing is this thing I was talking about a minute ago where you got a lot of rare stuff but then you also got bad stuff here too. And that doesn't usually happen. And pulling those things together was how we got a root cause report for this particular stack and this is, this is a typical sort of detection profile. It's interesting when you get an autonomous work monitoring stack kind of working on logs it's interesting because you can deploy like litmus which the chaos testing engineering tool, and basically you'll deploy some test and they'll break some stuff. And the interesting thing is that thing itself it's going to log stuff it's going to say well I'm initializing this and I'm selecting a pod to kill and so on. Well guess what, if your autonomous monitoring solution is doing a job. It's going to realize that that is actually the root cause of, of the outcome. Right. So if you think about it that way, if the things really doing a good job the root cause is actually the fact that you ran a chaos test. And here it is. So, so I just thought this is hilarious because it actually makes it harder to replicate things, you know, to that are kind of realistic but we're going to show example that a minute. Here's a couple examples. This is there's this pull from a blog on our website, just some recipes to go and kind of test this thing and see what happens. What we're doing is we're pulling out the rare stuff here, we're pulling out kind of the bad stuff here, and we're pulling features out of metrics that give you more flavor to help you know that yeah you're on to the right sort of approach to causing this problem. Here's an example where we're doing the same thing except we've grabbed a description of this problem out of GPT three so we pass in our root cause report we get back a description we put it in here. And so like here you can see the first thing that happened was there was no killer was invoked and then all hell broke loose and yes in fact this was no problem and that's what got put in here. And this I'm going to let this kind of stuff I think the Gavin's going to show but this is another example of that same incident where you can see kind of the swap, some more modern interface you can see this kind of the swap free bites from Prometheus dropping and we pull out that anomaly, we feed those anomalies directly into the same correlation model. It just, it goes with the log anomalies and so you get this all stitched into one correlated sort of report. So with that I'm going to, I'm going to pass back to pass back to Gavin. Hey thanks Larry, and I'm going to continue where I left off. So let me share my screen. So if you remember we have our, our broken microservices that over here. We know why we broke it, but as I mentioned before we didn't build a rule for that in our machine learning. We don't know what's over the environment it's only seen a couple of hours of logs. So let's have a look with Zebra and detected. And you can see here, actually two incidents were were detected both for the same thing and I'll sort of focus on this one, but some pretty cool stuff. We just mentioned the, the GPT three natural language AI engine. And this is what it came back with so kind of nailed exactly what the problem was so that's really cool. And that was because we gave it very tight input which was the whole incident. And then you see we show the, the hosts and logs that were detected where the anomalous patterns or correlate sets of correlated anomalous patterns occurred. So the first event in this case actually again it's kind of beautiful because it exactly nailed the root cause scales it picked up the Kubernetes message about the product service, product catalog service being scaled down to zero replicas. And then we see the, the dead container over here. And I'm going to drill into the incident. And this gives you a little bit more detail about what happened and what made up the incident from our machine learnings perspective. So there are four log events that were picked up as being anomalous abnormally anomalous with, you know, in a correlated way, you're seeing the scale down successful delete killing the container, and then finally container not running so that's pretty clear what's happening. And then we see anomalous metrics that were picked up at this time and these are really symptomatic of the problem you can see all the CPU on one of the containers that was sort of buzzing along suddenly took a dive at exactly that time when we killed the pod so we pick that up as being some correlated metrics with with this incident, but just a really quick demo but summarizing. I did nothing unnatural here. I installed our log collector and metrics collector, which sent this telemetry to our, our app. And then we found we learned the patterns and then we found this anomalous pattern and sort of the jumble of everything else. So, that's it if anybody's interested and I think Larry has the link in his presentation, we've documented how to bring up this demo app in a mini cube environment if anyone wants to test it, and then you can break it anyway you choose and see what's even picks up. Thank you very much and I'll pass that one back to Larry. Okay. Thanks Kevin. Let me bring this back up. So we've had some recent validation in the market so we've. So my data, they make this litmus to toolset that I mentioned, so they also have open EBS which is kind of like a, it's kind of like a like a storage device sort of simulation software so essentially, it's basically stored software so so basically what they did was they went in they said okay, we're going to go and we're going to replicate the outages that are real customers had, and they picked like six or seven of them. They replicated those in their environment and we picked those up with root cause indicators so that was cool. So D zone wrote a great article about us and then, you know, the guy wrote the article basically, you know, spun up the, you know, put in the charts spun up the software did something to break something and then, and then see it show up so that was that was cool. So D zone is a customer of ours right now the, the music equipment retailer. And so they, they ended up concluding that we've dropped their root cause time for new incidents from three hours of 15 minutes. And that's made a huge, that's made a huge sort of dent in their ability to deliver, you know, high quality service to their, to their users. Everyone's encouraged to join us on this journey. I'm always happy to have discussions with everyone, anyone just about sort of in general log machine learning and only detection any ideas people have I love to, I love to have those discussions. We have a slack community as well. You can participate in. And I think that's that's a bit I want to thank everyone here for the time and the interest in log machine learning on communities. Alright, thanks everyone. We have about 10 minutes left so we have two questions so far in the Q&A box so there's any more go ahead and pop them in there. And I will hand it back over to you Larry to get started on these. Okay, so am I supposed to look in the chat right now and I can read it to you if you want. Okay, yeah I don't. I mean if you just pick some off yeah. We have ingest in points around the globe. Okay, so, right so by default when you, you know, come in and you, you know become a user, you go into our, you know, we have a AWS we have a, we have a deployment AWS. It's low it's, it's West US, but we've, we have spun up, we've spun up incidents instances in other geos on request so you know don't don't be sorry. We don't mind doing it. Alright, how long does it take for training to complete. How frequently is the model trained and our GPU is required. GPUs are not required. So, so there really is no determinant, sort of training period. Basically there's a bunch of parameters that are getting estimated and those estimates suck, like 10 minutes after you install the software. And, you know, after a day or two they're probably really good. And so there's a continuum over the first few hours where it starts to get more and more useful. You know, sometimes you'll see like spurious stuff popping up at the beginning. So, so that's kind of the approach we've taken because a lot of people they'll just want to spin it up, do some chaos tests. You know, and be done so so we let them do that but you know if you do that you're going to get noise and some things may not get picked up exactly right so. Hey, anyone else here we go here's one more. Any thoughts comments on using nearest neighbor analysis for correlation. Okay, so I'm going to use neighbor analysis can can you be more specific. Like, in what way. Or should I get to the chat let me see if I can get into the chat here and see this under the Q&A with this banner Q&A. Okay, okay, perfect, perfect. Okay. That was all and wrote. Hey and write, write some more and let me. Okay. Thanks for running this on Prem as well. I mean I'm, yes, actually that's a that's a very good question so as it turns out, a lot of people really want this stuff run on Prem when it has to do with logs. They're a little bit afraid of, you know, sort of PII and that sort of thing, you know it depends depends on the user right. So, so we have a project that we kicked off this week to to build exactly such a thing. And so you should get in touch with us if you're interested in being sort of one of the first pilot users of that on Prem. It's a kind of a virtual appliance as we're packaging it. What API. Do you need access in order to collect the logs. The ADS cluster, what API do you need access. Hey, Rod, do you understand this question I'm not sure I'm not sure what's being asked. I think the simple way to answer that is, you don't need to worry about it. Collector deploys as the demon said, it picks up container logs automatically from, you know, the container log outputs, and it does use Kubernetes API to pick up events. If you're interested you can examine our documentation on GitHub, or you can think as for more details. So the short answer is you shouldn't have to worry about it you don't actually need to do any work, which is just a home chart and the beam said automatically collects everything. Right. Yeah, so to me. Okay, so maybe the question was kind of about what, when they put in the chart what gets deployed and it's kind of a, it's a fluid D fork for logs and then a Prometheus scraper fork for metrics. So editors for log anomaly, it's so hard to you compare and differentiate from them. Yeah, I mean there are so. So Ajay you want to, you want to address this in a. Yeah, so log anomaly detection is an area of interest has been for about a decade. Larry mentioned some of the academic research in the space you know lcs and deep learn or whatever. There are projects and even commercial products that have attempted anomaly detection to various degrees, they haven't gone as far as we go in terms of correlating anomalies and detecting incidents. We have one such comparison with the elastic machine learning pack on our website and we have even have a short video comparing them side by side. It might be the best place to start. It's under the blogs page, just look for elastic, and you'll see a side by side comparison of our machine learning with the last days. Right. Hey, now I understand what the hands question so right. Yeah, so. Right so, so what basically like I guess you would say so I mentioned we have this sort of multi stage pipeline for structuring. Right so the first, the first is actually heuristic so finally I've only seen something. And it's, it's kind of there's kind of a chicken and problem but but if I if I don't already have a parse for this thing. I'll start by looking at it with heuristics or and as well as kind of a couple default tokenizations. And then I'll do an attempted clustering of that with other such things that I do not yet have bucketed. And but that. So that clustering is done using sort of reach ability clustering so I'll pick a couple tokenizations, and I'll say, which, which tokenization gets me, you know, sort of the most sort of compatible like things and by reach clustering I mean, you know, by given tokenization I have 18 tokens. If if I can, if I can kind of create this set by reaching one example to the next with two, with only two parametric tokens then I'll use that as a cluster. And so it's kind of, it's kind of against the clustering it's it's not, it's not sort of a trivial sort of, you know, nearest neighbor like bag of words kind of stuff but it is. It's a clustering approach. Hopefully that helped you shoot me an email if you want to kind of talk. All right, three minutes and there's two questions left looks like they're getting answers. Is there anything else we want to cover. Oh wait there was one that just came in we've got literally 60 seconds probably. So, so right so this is kind of an approach that's used in tracing right so you'll, you might see logs show up with the trace ID. That's sort of a thing. And the question is what's the difference. The difference is I don't need support for it throughout my stack and I don't need to go trace anything. So, remember the whole idea was let's create something that can work without alert rules. We don't require tracing to go through every relevant code path. It feels like it felt like to us like a self defeating kind of prospects just trading one set of work for another or one set of limitations for another, and that's why we tried to avoid requiring that. I hope that made sense. Okay. Thank you everyone for coming and for participating. Thank you for the robust Q&A and all the back and forth and for everybody helping panelists answer questions through the chat. Just a reminder, this will be up on the website later today. And we look forward to seeing everyone at a future CMCF webinar. Thanks so much Larry. Thanks, everyone else.