 Hi, everybody. Thank you for joining us today for the Virtual Vertica BDC 2020. Today's breakout session is entitled Autonomous Monitoring Using Machine Learning. My name is Sue LeClaire, Director of Marketing at Vertica, and I'll be your host for this session. Joining me is Larry Lancaster, Founder and CTO at Zebraium. Before we begin, I encourage you to submit questions or comments during the virtual session. You don't have to wait. Just type your question or comment in the question box below the slides and click submit. There will be a Q&A session at the end of the presentation, and we'll answer as many questions as we're able to during that time. Any questions that we don't address, we'll do our best to answer them offline. Alternatively, you can also go and visit Vertica forums to post your questions after the session. Our engineering team is planning to join the forums to keep the conversation going. Also, just a reminder that you can maximize your screen by clicking the double arrow button in the lower right corner of the slide. And yes, this virtual session is being recorded and will be available for you to view on demand later this week. We'll send you a notification as soon as it's ready. So let's get started. Larry, over to you. Hey, great. Thanks so much. So hi, my name is Larry Lancaster, and I'm here to talk to you today about something that I think is time has come and that's autonomous monitoring. So with that, let's get into it. So machine data is my life. I know that's a sad life, but it's true. So I've spent most of my career kind of taking telemetry data from products, either in the field, we used to call it in the field, or nowadays that's been deployed, and bringing that data back like log files, stats, and then building stuff on top of it. So tools to run the business or services to sell back to users and customers. And so, you know, after doing that a few times, it kind of got to the point where, you know, I was really sort of sick of building the same kind of thing from scratch every time. So I figured why not go start a company and do it so that we don't have to do it manually ever again. So it's interesting to note I've put a little sentence here saying companies where I got to use Vertica. So I've been actually kind of working with Vertica for a long time now, pretty much since they came out of Alpha, and I've really been enjoying their technology ever since. So our vision is basically that I want a system that will characterize incidents before I notice. An incident is, you know, we used to call it a support case or a ticket in IT or a support case in support. Nowadays, you may have a DevOps team or a set of SREs who are monitoring a production sort of deployment, and so they'll call it an incident. So I'm looking for something that will notice and characterize an incident before I notice and have to go digging into log files and stats to figure out what happened, right? And so that's a pretty heady goal, and so I'm going to talk a little bit today about how we do that. So if we look at logs in particular, logs today, if you look at log monitoring, it's kind of that whole umbrella term that we use to talk about how we monitor systems in the field that we've shipped or how we monitor production deployments in a more modern stack. And so basically there are log monitoring tools, but they have a number of drawbacks. And for one thing, they're kind of slow in the sense that if something breaks and I need to go to a log file, actually chances are really good, right, that if you have a new issue, if it's an unknown, unknown problem, you're going to end up in a log file. And so the problem then becomes basically you're searching around looking for what's the root cause of the incident, right? And so that's kind of time consuming. So they're also fragile, and this is largely because log data is completely unstructured, right? So there's no formal grammar for a log file, right? And so you have this situation where if I write a parser today and that parser is going to do something, it's going to execute some automation, it's going to open or update a ticket, it's going to maybe restart a service or whatever it is that I want to happen. What will happen is later upstream someone who's writing the code that produces that log message, they might do something really useful for users and they might go fix a spelling mistake in that log message. And then the next thing you know, all the automation breaks, right? So it's a very fragile source for automation. And finally, because of that, people will set alerts on, oh, well, tell me how many thousands of errors are happening every hour or some horrible metric like that. And then that becomes the only visibility you have in the data. So because of all this, it's a very human driven, slow, fragile process. And so, you know, basically we've set out to kind of up level that a bit. Yeah, so I touched on this already, right? So the truth is if you do have an incident, you're going to end up in log files to do root cause. It's almost always the case. And so you have to wonder if that's the case why do most people use metrics only for monitoring? And the reason is related to the problems I just described, they're already structured, right? So for logs, you've got this mess of stuff, and so you only want to dig in there when you absolutely have to. But ironically, it's where a lot of the information that you need actually is, right? So we have a model today, and this model used to work pretty well. And that model is called index and search. And it basically means you treat log files like they're text documents. And so you index them. And when there's some issue you have to drill into, then you go searching, right? So let's look at that model, right? So 20 years ago, we had sort of a shrink wrap software delivery model. You had an incident, and with that incident, maybe you had one customer and you had, you know, a monolithic application and a handful of log files. And so it's perfectly natural. In fact, usually you could just VI the log file and search that way. Or if there's a lot of them, you could index them and search them that way. And that all worked very well. It's scaled because the developer or the support engineer had to be an expert in those few things and those few log files and understand what they meant. But today, everything has changed completely, right? So we live in a software as a service world. What that means is, you know, for a given incident, first of all, you're going to be affecting thousands of users. You're going to have potentially 100 services that are deployed in your environment. You're going to have a thousand log streams to sift through. And yet, you're still kind of stuck in the situation where to go find out what's the matter, you're going to have to search through the log files. So this is kind of the unacceptable sort of position we're in today. So for us, the future will not be indexed search. And that's simply because it cannot scale. And the reason I say that it can't scale is because it all kind of is bottlenecked by a person in their eyeball. So you continue to drive up the amount of data that has to be sifted through. The complexity of the stack that has to be understood. And you still, at the end of the day, for MTTR purposes, you still have the same bottleneck, which is the eyeball. And so this model, I believe, is fundamentally broken. And that's why I believe in five years you're going to be in a situation where most monitoring of unknown, unknown problems is going to be done autonomously. And those issues will be characterized autonomously because there's no other way it can happen. So I'm going to talk a little bit about autonomous monitoring itself. So autonomous monitoring basically means if you can imagine a monitoring platform, right, and you watch the monitoring platform, maybe you watch the alerts coming from it, or you watch the, more importantly, you kind of watch the dashboards and try to see if something looks weird, right? So autonomous monitoring is the notion that the platform should do the watching for you and only let you know when something is going wrong and should kind of give you a window into what happens. So if you look at this example I have on screen, just to take it really slow and absorb the concept of autonomous monitoring. The idea is that, so here in this example we've stopped the database. And as a result, down below you can see there were a bunch of fallout. This is an Atlassian stack so you can imagine you've got a Postgres database and then you've got sort of BitBucket and Confluence and JIRA and these various other components that need the database operating in order to function. And so what this is doing is it's calling out, hey, the root cause is the database stopped and here's the symptoms. Now you might be wondering so what, I mean I could go write a script to do this sort of thing. So here's what's interesting about this very particular example and I'll show a couple more examples that are a little more involved. But here's the interesting thing. So in the software that came up with this incident and opened this incident and put this root cause and symptoms in there, there's no code that knows anything about timestamp formats, severities, Atlassian, Postgres, databases, BitBucket, Confluence. There's no regexes that talk about starting, stopped, rdbms, swallowed an exception and so on and so forth. So you might wonder how it's possible then that something which is completely ignorant of the stack could come up with this description which is exactly what a human would have had to do to figure out what happened and I'm going to get into how we do that but that's what autonomous monitoring is about. It's about getting into a set of telemetry from a stack with no prior information and understanding when something breaks and I could give you the punchline right now which is there are fundamental ways that software behaves when it's breaking and by looking at hundreds of data sets that people have generously allowed us to use containing incidents we've been able to characterize that and now generalize it to apply it to any new data set and stack. So here's an interesting one right here, right? So there's a fellow, David Gildey, just a genius in the monitoring space he's been working with us for the last couple of months. So he said, you know what I'm going to do is I'm going to run some chaos experiments. So for those of you who don't know what chaos engineering is, here's the idea. So basically let's say I'm running a Kubernetes cluster and what I'll do is I'll use sort of a chaos injection test, something like litmus and basically it will inject issues, it'll break things in my application randomly to see if my monitoring picks it up and so this is what chaos engineering is built around. It's built around sort of generating lots of random problems and seeing how the stack responds. So in this particular case, David went in and he, you know, he went in and he deleted basically one of the tests that was presented through litmus did a delete of pod delete and so that's going to basically take out some containers that are part of the service layer and so then you'll see all kinds of things break. And so what you're seeing here, which is interesting, this is why I like to use this example because it's actually kind of eye-opening. So the chaos tool itself generates logs and of course through Kubernetes, like all the log file locations that are on the host and the container logs are known and those are all pulled back to us automatically. So one of the log files we have is actually the chaos tool that's doing the breaking, right? And so what the tool said here when it went to determine what the root cause was was it noticed that there was this process that had these messages happen initializing deletion lists, selecting a pod to kill, blah, blah, blah. It's saying that the root cause is the chaos test and it's absolutely right. That is the root cause. But usually chaos tests don't get picked up themselves. You're supposed to be just kind of picking up the symptoms, but this is what happens when you're able to kind of tease out root cause from symptoms autonomously is you end up getting a much more meaningful answer. So here's another example. So essentially we collect the log files, but we also have a Prometheus scraper. So if you export Prometheus metrics, we'll scrape those and we'll collect those as well. And so we'll use those for our autonomous monitoring as well. So what you're seeing here is an issue where I believe this is the where we ran the something out of disk space. But so when it, it opened an incident, but what's also interesting here is you see that it pulled that metric to say that this metric was actually a, that the spike in this metric was a symptom of this running out of space. And so again, there's nothing that knows anything about file system usage, memory, CPU, any of that stuff. There's no actual hard coded logic anywhere to explain any of this. And so the concept of autonomous monitoring is looking the way at a stack, the way a human being would, if you can imagine how you would walk in and monitor something, how you would think about it. You go looking around for rare things, things that are not normal. Then you would look for indicators of breakage and you would see do those seem to be correlated in some dimension. That is how the system works. So as I mentioned a moment ago, metrics really do kind of complete the picture for us. So we end up in a situation where we have a one-stop shop for incident root cause. So how does that work? Well, we ingest and we structure the log files. So if we're getting the logs, we'll ingest them and we'll structure them. I'm going to show a little bit what that structure looks like and how that goes into the database in a moment. And then of course we ingest and structure the Prometheus metrics. But here structure really should have an asterisk next to it because the metrics are mostly structured already. They have names. If you have your own scraper as opposed to going into the time series Prometheus database and pulling metrics from there, you can keep a lot more information about metadata about those metrics from the exporter's perspective. So we keep all of that too. Then we do our anomaly detection on both of those sets of data and then we cross correlate metrics and log anomalies and then we create incidents. So this is at a high level kind of what's happening without any sort of stack specific logic built in. So we had some exciting recent validation. So my data is a pretty big player in the Kubernetes space. So essentially they do Kubernetes as a managed service. They have tens of thousands of customers that they manage their Kubernetes clusters for them. And then they are also involved both in the open EBS project as well as in the litmus project I mentioned a moment ago. That's their tool. For chaos engineering. So they're a pretty big player in the Kubernetes space. So essentially they said, okay, let's see if this is real. So what they did was they set up our collectors which took three minutes in Kubernetes. And then they went and they, using litmus, they reproduced eight incidents that their actual real world customers had hit and they were trying to remember the ones that were the hardest to figure out the root cause at the time. And we picked up and put a root cause indicator that was correct in 100% of these incidents with no training configuration or metadata required. So this is kind of what autonomous monitoring is all about. So now I'm going to talk a little bit about how it works. So like I said, there's no information included or required about. So like if you imagine a log file, for example, right? Now, commonly over to the left-hand side of every line, there will be some sort of a prefix. And what I mean by that is you'll see like a timestamp and a severity and maybe there's a PID and maybe there's a function name and maybe there's some other stuff there. So basically that's kind of, it's common data elements for a large portion of the lines in a given log file but of course the content's changed. So basically today, like if you look at a typical log manager, they'll talk about connectors and what connectors means is for an application, it'll generate a certain prefix format in a log and that means what's the format of the timestamp and what else is in the prefix and this lets the tool pick it up. And so if you have an app that doesn't have a connector, you're out of luck. Well, what we do is we learn those prefixes dynamically with machine learning. You do not have to have a connector, right? And what that means is that if you come in with your own application, the system will just work for it from day one. You don't have to have connectors. You don't have to describe the prefix format. It's just, that's just, that's so yesterday, right? So really what we want to be doing is up-leveling what the system is doing to the point where it's kind of like working like a human would, right? You look at a log line. You know what's a timestamp. You know what's a PID. You know what's a function name. You know where the prefix ends and where the variable parts begin. You know what's a parameter over there in the variable parts. And sometimes you may need to see a couple examples to know what was a variable but you'll figure it out as quickly as possible and that's exactly how the system knows about it. As a result, we kind of embrace free tax logs, right? So if you look at a typical stack, most of the logs generated in a typical stack are usually free text. Even structured logging typically will have a message, sort of a message attribute, which then inside of it has the free text message, right? So it's kind of like, for us, that's not a bad thing. That's okay. In fact, I'd prefer that people just use, you know, the purpose of a log is to inform people. And so there's no need to go rewrite the whole logging stack, right? Just because you want a machine to handle it. Why can't machines go figure it out for themselves, right? So you give us the logs and we'll figure out the grammar, not only for the prefix, but also for the variable message part. So I already went into this, but there's more that's usually required for configuring a log manager with alerts. You have to give it keywords. You have to give it application behaviors. You have to tell it some prior knowledge. And of course the problem with all of that is that the most important events that you'll ever see in a log file are the rarest, right? Those are the ones that are one out of a billion. And so you may not know what's going to be the right keyword in advance to pick up the next breakage, right? So we don't want that information from you. We'll figure that out for ourselves. So as the data comes in, essentially we parse it and we categorize it, as I mentioned. And when I say categorize, what I mean is if you look at a certain given log file, you'll notice that some of the lines are kind of the same thing, right? So this one will say x happened five times, and then maybe a few lines below it will say x happened six times, but that's basically the same event type. It's just a different instance of that event type, and it has a different value for one of the parameters, right? So when I say categorization, what I mean is figuring out those unique types. And I'll show an example of that next. Anomaly detection, we do on top of that. So anomaly detection on metrics in a very sort of time series by time series manner with lots of tunables is a well-understood problem. So we also do this on the event type occurrences. You can think of each event type occurring in time as sort of a point process, and then you can develop statistics and distributions on that, and you can do anomaly detection on those. So once we have all of that, we've kind of extracted features essentially from metrics and from logs. We do pattern recognition on the correlations across different channels of information. So different event types, different log types, different hosts, different containers, and then of course across to the metrics. And based on all this cross correlation, we end up with a root cause identification. So that's essentially at a high level how it works. What's interesting from the perspective of this call particularly is that incident detection needs relationally structured data. It really does. You need to have all the instances of a certain event type that you've ever seen easily accessible. You need to have the values for a given sort of parameter easily, quickly available so you can figure out what's the distribution of this over time, how often does this event type happen? You can run analytical queries against that information so that you can quickly in real time do anomaly detection against new data. So here's an example of what this looks like and this is kind of part of the work that we've done. So at the top you see some examples of log lines. So that's kind of a snippet of three lines out of a log file. And you see one in the middle there that's kind of highlighted with colors. And this is, I mean it's a little messy, but it's not atypical of the log file that you'll see pretty much anywhere. So there you've got a timestamp and a severity and a function name and then you've got some other information. And then finally you have the variable part and that's going to have sort of this, there's kind of, you can tell that checkpoint from memory scrubber is probably something that's written in English just so that the person is reading the log file can understand and then there's some parameters that are put in, right? So now if you look at how we structure that, the way it looks is there's going to be three tables assigned to that corresponds to the three event types that we see above. And so we're going to look at the one that corresponds to the one in the middle. So if we look at that table, there you'll see a table with columns, one for severity, for function name, for time zone and so on and date and PID. And then you see over to the right with the colored columns there's the parameters that were pulled out from the variable part of that message and so they're put in, they're typed and they're in integer columns. So this is the way structuring needs to work with logs to be able to do efficient and effective anomaly detection. And as far as I know we're the first people to do this inline. So we can move forward button here. All right, so let's talk now about Vertica and why we take those tables and put them in Vertica. So Vertica really is an MPP column store, but it's more than that because nowadays when you say column store, people sort of think, oh yeah, so like for example Cassandra is a column store or whatever, but it's not, right? Cassandra is not a column store in the sense that Vertica is. So Vertica was kind of built from the bottom, from the ground up to be, you know, so it's the original column store, right? So back in the C store project at Berkeley, right, that Stomrecker was involved in, he said, you know, let's explore what kind of efficiencies we could get out of a real column or database. And what he found was that he and his grad students who started Vertica, what they found was that what they can do is they can build a database that gets orders of magnitude better query performance for the kinds of analytics I'm talking about here today with orders of magnitude less data storage underneath. So when we look at that, right, so building on top of machine data, as I mentioned, is hard because it doesn't have any defined schemas. But we can use an RDBMS like Vertica once we've structured the data to do the analytics that we need to do. So I talked a little bit about this, but if you think about machine data in general, it's perfectly suited for a column store because if you imagine laying out sort of all the attributes of sort of an event type, right, so you can imagine that each occurrence is going to have, you know, so there may be, I mean, there may be say three or four function names that are going to occur forever for all the instances of a given event type. And so if you were to sort all of those sort of event type, event instances by function name, what you would find is that you'd have sort of long, million long runs of the same function name over and over. So what you have in general in machine data is lots and lots of slowly varying attributes, lots of low cardinality data that gets almost completely compressed out when you use a real column store. So you end up with a massive footprint reduction on disk. And it also, that propagates sort of through the analytical pipeline because Vertica does late materialization, which means it tries to carry that data through memory with that same efficiency, right. So the scale out architecture, of course, is really suitable for p2 scale workloads. Also, I should point out, I was going to mention it in another slide or two, but we use the Vertica Eon architecture and we have had no problems scaling that in the cloud. It's a beautiful sort of rewrite of the entire data layer of Vertica. It is like the performance and flexibility of Eon is just unbelievable. And so I've really been enjoying using it. I was skeptical you could get a real column store to run in the cloud effectively, but I was completely wrong. So finally, I mean, I should mention that, you know, like if you look at column stores really, to me, Vertica is the one that has sort of, it has the full SQL support. It has the ODBC drivers. It has the ACID compliance, which means I don't need to worry about these things as an application developer, right. So I'm laying out the reasons that I like to use Vertica. Right, so I touched on this already, but really using essentially what's amazing is that Vertica Eon is basically using S3 as an object store. And of course, there are other offerings like the one that Vertica does with peer storage that doesn't use S3, but what I find amazing is how well the system performs using S3 as an object store and how they manage to keep an actual consistent database, and they do. I mean, we've had issues where we've gone and shut down hosts or hosts have been shut down and on us and we have to restart the database and we don't have any consistency issues. It's unbelievable the work that they've done. So essentially another thing that's great about the way it works is you can use the S3 as sort of a shared object store. You can have query nodes kind of querying from that set of files, largely independently of the nodes that are writing to them. So you avoid this sort of bottleneck issue where you've got contention over who's writing what and who's reading what and so on. So I've found the performance using separate subclusters for our UI and for the ingest has been amazing. Another couple of things that they have is sort of they have a lot of in-database machine learning libraries. There's actually some cool stuff on their GitHub that we've used. One thing that we make a lot of use of is sort of the sequence and time series analytics, which basically means sequence analytics. So for example, in our product you can, so even though we do all this stuff autonomously, you can also go create alerts for yourself. And one of the kinds of alerts you can do is say, it's okay if this kind of event happens within so much time and then this kind of an event happens but not this one, then you can be alerted, right? So you can have these kind of sequences that you define of events that would indicate a problem, and we use their sequence analytics for that. So it kind of gives you really good performance on some of these sort of queries where you're wanting to pull out sequences of events from a fact table, right? And time series analytics is really useful if you want to do analytics on the metrics and you want to do gap filling interpolation on that. It's actually really fast in performance, and it's easy to use through SQL. So those are a couple of vertical extensions that we use. So finally, I would just like encourage everybody, hey, come try this out. It should be up and running in a few minutes if you're using Kubernetes. If not, it's over long. It takes you to run an installer. And so you can just come to our website, pick it up and try out autonomous monitoring. And I want to thank everybody for your time. And we can open it up for Q&A.