 Greetings and welcome. My name is Gavin Cohen, and I'm the VP of Product Azibrium. And I'm here virtually with Larry Lancaster, our founder and CTO, and Iran Khanna, the co-founder and CEO of Reserved AI. To get started, I'm going to demonstrate the problem that we address. I've deployed online boutique, a microservices demo app, on a Kubernetes cluster on my laptop using Minicube. Now, the catch is I've deliberately broken it. Let's see what happens when I try to buy something. I'm going to buy the home barista kit. Uh-oh, the dreaded and very generic 500 error. And what I see here really doesn't help explain what happens. Now, in real life deployments, when something breaks, there are two key things. Detecting that a broke and finding the root cause. For detection, most companies use some kind of monitoring, or ABM tool. In this case, I was using a Kiali service graph. But this could have been any other monitoring dashboard. What we see here is a lot of red confirming things are broken. But once again, it doesn't shed much light on what happens. In many environments, monitoring is integrated with an incident management tool with rules that automatically trigger incidents when things go wrong. Here, I'm showing a pager duty incident that is being created and shows up in a Slack channel. But now it's time for the tough part. We've detected that there is a problem. But what was the root cause? If you're root causing a new problem, chances are you're going to look in log files. And experienced SREs will typically start by looking for rare and bad things. And then they'll look for clusters of these rare and bad events, especially when they span across hosts and services. But this isn't always easy. And it can be a long iterative process that relies on intuition, experience, and maybe a whole lot of other things. Hopefully, at the end of this process, which could take minutes or hours, you can find out what happened and uncover root cause. Now, imagine if this whole process could be automated. And the way it would work is you simply send your logs without any training, set up, or rules. You'd be able to see the root cause without any hunting. So with that, I'll show you what the machine learning found when I broke the microservices demo app that I showed you earlier. From the time I broke the app, and while the problem was happening, about 100,000 log events were generated. Without any rules, our ML distilled this down to just these seven events that you see on the screen. The goal of our ML is to help explain the root cause. And our measure of success is that it finds at least one root cause indicator. In this case, you can see it very clearly at the top, the very first line, in fact. Now, let me tell you as an aside, the way that I broke the app is I created a little C program called OOM Test. And it essentially iterates allocating one meg chunks of memory until it exhausts all the memory. Now, in this case, I ran it on the Kubernetes master. So what it did is it consumed all the memory and starved the whole Kubernetes cluster, running the Minicube on my laptop of memory. And then you saw all the problems that I showed you when the app actually broke earlier. The very first line that prints out when it starts is a warning message that says OOM Test starting. And you can see that as the first line in this root cause report. The cool thing is that was picked up simply because it was a very rare, in fact, probably never seen before log event. But on its own, it would have been completely harmless, except that it happened to correlate with a whole lot of other things. So as part of the summary that you see on the screen, you can see just a couple of lines later. It picked up a kernel message where the OOM killer was invoked and it actually killed off OOM Test. And you see that in the third line on the screen. So again, that was picked up completely automatically. So this is actually enough in this case to tell us the root cause. But we also try and show you the symptoms. In other words, what happened when the problem occurred? To see that, I'm gonna click the related events button. That pulls in the surrounding errors and anomalies that hopefully will explain the problem or the symptoms that happened. And in fact, you can see here it's doing a really good job because it's immediately pulling in a bunch of Kubernetes events. And you can see how all the other pods were impacted. You can see as it moves through there, it has failed probes on the different modules, the ad service, the cart service, the checkout service, the currency service and so on. And if you go a little bit further down, you'll even see it picks up where Redis is impacted and restarts. So this is brought in automatically. Remember, I haven't hunted or searched for anything here. I just clicked the related events button. So let me go back to the core events. Because I'm also collecting Prometheus metrics in this case, you can see at the bottom that the machine learning tries to correlate any anomalous metrics with what it's picked up in the logs. And so it's pulled in a couple of stats that you see on the left. Node memory buffers bytes and node memory cache bytes are highly relevant in this case. You can see them going from very highs and then dropping right down, presumably as the O and killer killed off my rogue process. So these are really useful to corroborate a root cause report that you might be writing. Now, the final thing I'll show you is the coolest of them all. We actually take this log line summary and remember this is distilled down from a hundred thousand events that occurred at the time. And we pass it through with the right prompts to the GPT-3 language model. And if you look at the top left-hand side, you can see there's some plain language text that's returned. In fact, in this case it says the system was running out of memory and the O and killer was invoked, which is kind of a perfect summary that you could even paste in as the title to your post-mortem root cause report. This is an experimental feature. And the reason it is, is because we're still tweaking the way that we use GPT-3, but in general, GPT-3 is only as good as the internet as a whole and all the texts that it's been trained on. Now, in this case, it completely nailed it. There are some times when it produces good English sentences, but they may not be completely relevant to the problem that you're seeing. But in general, we're seeing a lot of use from these. And the last thing I'll point out is the sentence that you see there is truly a novel sentence. You can't actually find that anywhere on the internet. It was generated based on what we gave the GPT-3 model as a prompt. Now, with that, I'm gonna hand it over to Aran Khanna. He's the co-founder and CEO of Reserve AI. He's been a fantastic customer of ours now for about a year. In fact, he tried one of our early beta versions, which is the first thing that he saw, and he's continued to use the product since. So thank you very much, Aran, and I'll hand it over to you. Thank you so much for the awesome demo. So just to give a little bit of background on Reserve AI, I'm the co-founder CEO. And what we enable customers like Zibrium and a lot of other large customers running on the cloud across Azure and AWS to do is actually proactively forecast and manage cloud resources in a completely automated way. So we actually enable folks running across these very complex multi-cloud deployments to do things like commitment management, casual forecasting, tax optimization, and uniquely we actually buy back over committed resources from customers, essentially making a market, but really on the granular level, integrating with tons and tons and tons of different APIs from over 300 APIs in the AWS land, 200 APIs in Azure, a ton of different APIs coming out of the Kubernetes clusters that we're monitoring for cost and attribution. And what we really found, obviously the installation is quite simple, but the backend software is very complex. What we really found was that given this wealth of data and the wealth of systems running in Kubernetes built on top of that, that there were things that were constantly changing within the underlying primitives that we're essentially pulling from on the Kubernetes side, on the Azure side, on the AWS side. And while we have this stack running, each component was generating tons and tons of logs. And when there was an error, not even an error on our side often, an error on the vendor side, or even on the customer side as things like IAM roles change, we were not able to very easily go in and get the actual root cause of that out the other end, either forward to our engineering team or customer success team or sales team, what have you. And what that really meant was that our critical engineering resources, as a startup, we like to move fast and build things on behalf of our customers, but they were getting waylaid, once a week at the very least, going through all of these different types of kind of debugging procedures to find root causes. And even worse was the fact that a lot of these root causes, just because we took the tact in many cases, like the out of memory case, just throw more resources at it, just kind of ironic as a cost optimization company. But given that, a lot of them actually went unnoticed until the volumes exploded to the point where we really had to look at them. So that was sort of the state of the world before. And when I heard about Zebrium, honestly, I was a little bit skeptical, right? And I think my engineering team was too. We use machine learning as well, but we use it in a much more sort of stayed way of doing predictive models, expected value calculations, and actually doing risk modeling and market making on the backend. So those are all sort of established things that folks on Wall Street, for example, have been doing for years. This was something that new. So I was very, let's say interested, but skeptical around if this could replace the specific DevOps knowledge that was needed before to really go in and figure out what was going on with this wealth of data kind of streaming in and the error sporadically showing up. So this was essentially something that we decided to kick the tires on. And we started the free trial with the Zebrium folks installed in our Kubernetes cluster. It was pretty quick. Actually, I was able to do it as, you know, the semi-technical CEO, which was a testament to how easy it was. I didn't even have to pull up my CTO or my DevOps folks into the conversation. And literally in the first week, AWS had API change. If you build on sort of the long tail of AWS APIs, you'll know what I'm talking about. They'll just change shit all the time and not tell you about it. If you're not on S3 or EC2, for example, and because we're built on that long tail and we have a number of systems there, this was actually a really important thing to catch because had this not been caught, if a customer went to a certain page, this would have caused, you know, a complete error and a service disruption essentially. So this was what sort of piqued my interest and said, hey, this starts to make sense. I think it's kind of working here. It's seeing that there is an error that we wouldn't have caught if we weren't looking at the logs. And as we sort of dug into the system, as Larry was showing before, we actually saw that the correlations and the root causes were really pointing to the exact system, to the exact sort of pod in this massive array of different services that was causing the underlying error. So it actually led to a faster resolution on our side. So at that point, you know, we were starting to buy in a bit and, you know, as, and that was sort of, you know, last year essentially as we were scaling up and we've been running the system for over a year now. And as we continued to run with it, we saw that, you know, the things that were being caught here were consistent. It wasn't just a one and done. We were consistently as we were building and seeing issues from our customer side, from the vendor side, getting these reports in our Slack channel with Zebriam. And, you know, this is an example right here where the customer actually had issues with their account where they were messing with an IAM role. And we were basically unaware of this entirely until we probably got a complaint from the customer. That would have been the forcing function. But because of Zebriam, we were able to get the Slack alert and we saw the customer was essentially messing with the role and had this big issue and we were able to escalate to our customer success team proactively, which is something that's fantastic as a business owner myself. I love when we can surprise and delight customers and get ahead of issues without them having to fall, you know, essentially fall on the sword and come and tell us that they screwed up. So we were really delighted by the fact that Zebriam was not only helping us with kind of the state operational pieces of our cloud infrastructure management, but really helping us surprise and delight our customers with the fact that we can really get ahead of a lot of these issues in this complex environment without our team having to build very sophisticated kind of internal monitoring tools. This was very much a plug and play. And this is sort of a more recent thing as Larry was showing, but, you know, usually when Zebriam sends an alert, I'm just shooting it along to my engineering team and say, hey, go look at this, go look at this. But now I can actually start with these NLP summaries that are coming out to figure out for myself, hey, what's going on? Do I need to just shoot it to my CTO and have him, you know, route to the right person? No, often I can actually understand because of these natural language summaries, you know, even as a CEO of the company, what the errors are, who is responsible, you know, who is the owner of that piece of infrastructure and really have a much more targeted loop with them. And even now our dev team is starting to look at these and much more quickly route them to the right place and easily understand the sort of underlying root causes that we're seeing in these stream of errors that we get from Zebriam. And this is something that, you know, I thought was absolutely science fiction before I saw it live because as you saw in the demo, the logs to, you know, to anyone who's a layman or even a, you know, a sophisticated engineer, it's kind of nonsense, right? It's not really well structured. So the fact that these natural language summaries were able to be generated with such high fidelity and, you know, very often, right? I've not seen a lot of cases where they're wrong. They're very often very spot on. That was something that really was a big draw for us to lean further into this system because of the fact that, you know, they seem to be making the impossible possible here and it really delights our engineering teams and helps us delight our customers in different ways either through reducing the downtime with faster resolutions, but also helping them debug issues of implementation and integration on their end with our systems. So kind of a 360 degree view of the Zebriam product was really important for us to get over the year to see how it could help us as it developed not only move faster on an engineering basis but really on a customer success basis as well. And I think that is something that I didn't even expect when we first integrated with the product, but I was obviously delighted to see as we moved down the path of integrating them further and further into our workflows. Thanks very much, Iran. We really appreciate having you as a customer. Continuing on, there are two main ways that you can use our machine learning. The first is when you have an incident management tool in place like PagerDuty, OpsGenie, or Slack. In this case, we have built-in integrations so when an incident is created in those tools, Zebriam automatically augments it with root cause. The way it works is you'll probably have some kind of monitoring or APM tool in place and when that detects something, it'll open up an incident in let's say PagerDuty. Now with our integration, as soon as an incident is opened or created in PagerDuty, it'll send us a signal, which is number two in that diagram and we'll respond with a root cause report very similar to what you saw in the UI demo. That will show up inside the PagerDuty incident and you never actually have to leave PagerDuty. Everything is automated in this case and you'll see everything you need inside PagerDuty without any log hunting. You can also use Zebriam without an incident management tool. In this case, when something breaks, you just look at the Zebriam root cause dashboard. We're always proactively scanning for patterns that make up root cause. So all you need to do is click on the relevant one and you'll see a root cause report that helps you troubleshoot the problem. If you don't see a relevant one, all you need to do is click the blue scan for root cause button and enter a time. We'll treat that as a signal that something has gone wrong with your app and we'll perform an on-demand scan for root cause around that time. Now, because our machine learning is constantly scanning for root cause patterns, a lot of engineering teams also use us to proactively detect problems, especially the unknown unknowns, the ones that don't have detection rules. If you use it this way, we recommend feeding these incidents that we detect to a P3Q and reviewing them when convenient. In real life, we've been able to detect a large number and variety of problems proactively that would otherwise have gone unnoticed and cause problems. Now let me explain how our machine learning works. I mentioned earlier that an experienced SRE would look for rare and bad events when trying to figure out root cause and then for clusters of these rare and bad things, especially if they occur closely in time and across different services or hosts. Well, the machine learning does the same exact thing, but to do this, it first needs to be able to accurately categorize all log events. So the first layer of our machine learning is to structure log events. This is done with unsupervised machine learning. Doesn't require any manual training. Once structured, the machine learning can categorize all the events by type and then it learns the patterns for each unique event type. The structuring we do underpins everything else that goes on in our platform. The next layer of machine learning is anomaly detection. There are lots of things that go into anomaly scoring, but the two big ones are events that are rare and events that are bad, like errors or high severity alerts, criticals and so on. Now, as each new event comes in, we essentially give it an anomaly score. And as an example, the rarer and the higher the severity, the more anomalous it would be. But it's important to remember, anomalies on their own can be very noisy. So now the magic. We take the anomalies and we look for clusters of abnormally correlated anomalies across log streams. This takes away the coincidental effect of having a few anomalies and allows us to really pinpoint something that's actually gone wrong and to pull out the correct log lines that would help to explain root cause. Now, once we've done that, we can then look at the metric space and look for any stats that have correlated anomalies that match the times that we see in the log lines that we've just pulled together. This is really cool, because you don't have to actually curate or tell us which metrics to look at. You can point all your metrics at us and we'll pull in the anomalous ones that are correlated with what we find in the logs. And you saw in the demo earlier, it's really effective at bringing in corroborating metrics. And then all of this is wrapped up into an automatic root cause report that you can see either inside your instant management tool or inside the Zebra and UI. Now, our ML is able to detect a very, very broad range of root causes for a very broad range of problems. This chart shows some examples of things that it's managed to find root cause for. But it's important to remember that there were no rules or any pre-definition of these types of problems whatsoever in the tool that were picked up because they essentially had patterns that you saw earlier in my previous slide that our machine learning was detected. So those clusters of anomalies across different log streams. The other important thing here is this is not meant to be an exhaustive list of what we can detect. It's just a bunch of examples across some common categories that we've seen happen within real life situations in our customer base. It helps automatically uncover root cause without you having to go hunting through logs. And with that comes the conclusion of this webinar. Thank you very much for watching. We love getting feedback, so please drop us a line with any questions or comments or suggestions. And you're also welcome to sign up for a free trial and try it for yourself with your own data. Thank you very much.