 Full stack observability is all the rage today. As businesses lean into digital, customer experience becomes ever more important. Why? Well, it's obvious. Fickle consumers can switch brands in the blink of an eye or the click of a mouse. Technology companies have sprung into action and the observability space is getting pretty crowded in an effort to simplify the process of figuring out the root cause of application performance problems without an army of PhDs and lab coats, also known as endlessly digging through logs, for example. We see decades old software companies that have traditionally done monitoring or log analytics and or application performance management stepping up their game. These established players, you know, they typically have deep feature sets and sometimes purpose built tools that attack one particular segment of the marketplace. And now they're pivoting through M&A and some organic development trying to fill gaps in their portfolio. And then you got all these new entrants coming to the market claiming end to end visibility across the so-called modern cloud and now edge native stacks. Meanwhile, cloud players are gaining traction and participating through a combination of native tooling combined with strong ecosystems to address this problem. But, you know, recent survey research from ETR confirms our thesis that no one company has at all. Here's the thing. Customers just want to figure out the root causes quickly and as efficiently as possible. It's one thing to observe the stack end to end, but the question is who is automating the observers? And that's why we're here today. Hello, my name is Dave Vellante and welcome to this special CUBE presentation where we dig into root cause analysis and specifically how one company, Zebrium, is using unsupervised machine learning to detect anomalies and pinpoint root causes and delivering it as an automated service. And in this session, we have two deep dives. First, we're going to dig into this exciting new field of Rcast root cause as a service with two of the founders and technical experts behind Zebrium. And then we bring in two technical experts from Cisco, an early Zebrium customer who ran a POC with Zebrium's service, automating and identifying root cause problems within four very well-established and well-known Cisco product lines including WebEx client and UCS. I was pretty amazed at the results and I think you'll be impressed as well. So thanks for being here, let's get started. With me right now is Larry Lancaster who's a founder and CTO of Zebrium and he's joined by Rod Bag who's the founder and vice president of engineering at the company. Gents, welcome, thanks for coming on. Thanks. Be here. All right, Rod, talk to me. Talk to me about software downtime, what root cause means, all the buzzwords in your domain, MTTR and SLO, what do we need to know? Yeah, I mean, it's like you said, I mean, it's extremely important to our customers and to most businesses out there to drive up time and avoid as much downtime as possible. So when you think about it, all of these businesses, most companies nowadays, either their product is software and it's running on the web and that's how you get a point and click or their business depends on it in internal systems to drive their business and to run it. When that is down, that is hugely impacting to them. So if you take a look way back, 20, 30 years ago, software was simple. There wasn't much to it. It was pretty monolithic and maybe it took a couple of people to maintain it and keep it running. It wasn't really anything complicated about it. It was a single tenant piece of software. Today's software is so complicated, often running maybe hundreds of services to keep that or to actually implement what that software is doing. So as you point out, enter the sort of observability space and the tools that are now in use to help monitor that software and make sure when something goes wrong, they know about it. But there's kind of an interesting stat around the observability space. So when you look at observability in the context or through the lens of the cost of downtime, it's really interesting. So observability tools are about a $20 billion market. But the cost of downtime, even with that in place is still hundreds of billions of dollars. So you're not taking much of a bite out of what the real problem is. You have to solve root cause and get to that fast. So it's all great to know that something went wrong but you got to know why. And it's our contention here that really when you take a look at the observability space, you have metrics that's a great tool. I mean, there's lots of great tools out there with around metrics monitoring that's going to tell you when something went wrong. It's very rarely it's going to tell you why. Similarly for tracing, it's going to point you to where the issue is. It's going to take you through that stack and probably pinpoint where it's happening or where something is running slow, potentially. So that's great. But again, the root cause of why it's happening is going to be buried in log files. And I can expand on that a little bit more but when you're a software developer and you're writing your software, those log files are a wealth of information. It's just a set of breadcrumbs that are littered with facts about how the software is behaving and why it's doing what it's doing or why it went wrong. And it's that that really gets you to the root cause very fast. And that's our contention is that these software systems are so complex nowadays and that the root cause is lying in those logs. So how do you get there fast? We would contend that you better automate that or you're just doomed for failure. And that's where we come in. Getting to that. Thank you, Rod. You know, it's interesting. You talk about the $20 billion market. There's an analogy with security, right? We spend 80 and $100 billion a year on securing our infrastructure and yet we lose probably a closer to a trillion dollars a year in breaches. And there's a similar analogy here. 20 billion could be 5x in downtime impacts or more. Okay, let's go to Larry. Tell us a little bit more about Zebrium. I'm interested always in asking a founder why you started the company. Rod touched on that a little bit. You guys have invented this concept of R-Cas. What does it mean? What problems does it solve and how does it solve the problem? Let's get into it. Yeah, hey, thanks, Dave. So I think when you said, you know, who's automating the observer, that's a great way to think about it because what observability really means is it's a property of a system that means you can see into it. You can observe the internal state and that makes it easier to troubleshoot, right? But the problem is if it's too complicated, you just push the bottleneck up to your eyeball. There's only so much a person can filter through manually, right? And I love the way you put that. So that's a great way to think about it is automating the observer. Now, of course it means that, you know, you reduce your MTTR, you meet your service level objectives, all that stuff, you improve customer experience. That's all true. But it's important to step back and realize like we have cracked a real nut here. People have been trying to figure out how to automate this part of sort of the troubleshooting experience, this human part of finding the root cause indicators for a long time. And until Zebra came along, I would argue no one's really done it right. So, you know, I think it's also important, you know, as we step back, we can probably look forward five to 10 years and say, everyone's going to look back and say, how did we do all this manually? You're going to see the sort of last mile of observability and troubleshooting is going to be automated everywhere. Because otherwise, you know, people are just, they're not going to be able to scale their business. So, you know, I think one more thing that's important to point out is, you know, I think Zebra, you know, it's one thing to have the technology, but we've learned we need to deliver it right where people are today. You can't just expect people to dive into a new tool. So, you know, we're looking at, you know, if you look at Zebra, you'll put us on your dashboard and we don't care what kind of a dashboard it is. It could be, you know, Datadog, New Relic, Elastic, Dynatrace, Grafana, AppDynamics, ScienceLogic, we don't care. You know, they're all our friends. So, we're more interested in getting to that root cause than trying to fight, you know, these incumbents and all that stuff. Yeah, so interesting. Again, another analogy I think about, you know, you talked about automation, we're gonna look back and say this is what we're never gonna do this again. It's like provisioning LUNs. Nobody provisions LUNs anymore. It's all automated. So, Larry, stay with you. Then the skeptic in me says, this sounds amazing, but if I, you know, it might be too good to be true. Tell us how it works. Yeah, so that's interesting. So, Cisco came along and they were equally skeptical. So, what they did was they took a couple of months and they did a very detailed study. And they got together 192 incidents across four product lines where they knew that the root cause was in the logs and they knew what that root cause was because they had had their best engineers, you know, work on those cases and take detailed notes of the incidents that had taken place. And so, they ran that data through the Zebraium software. And what they found was that in more than 95% of those incidents, Zebraium reflected the correct root cause indicators at the correct time. Like that blew us away. When we saw that kind of evidence, Dave, I have to tell you, everyone was just jumping up and down, it was like the Apollo Command Center, you know, when they finally, you know, touched down on the moon kind of thing. So, you know, it's really exciting at a point in time to be at the company, like just seeing everything finally being proven out according to this vision. I'm going to tell you one more story, which is actually one of my favorites because we got a chance to work with Seagate LiveCloud. So, they're, you know, a hyper-modern, you know, SaaS business, they're an S3 competitor. Zoom has their files stored on LiveCloud to let you know who they are. So, essentially, what happened was they were in alpha, early access, and they had an outage, and it was pretty bad. I mean, it went on for longer than a day, actually, before they were completely restored. And it was, you know, fortunately for them, it was early access. So, no one was expecting, you know, uptime, you know, service level objectives and so on. But they were scared because they realized something like this happens in production, you know, they're screwed. So, what they did was they saw Zebrium, they did some research, they saw Zebrium, they went in a staging environment, recreated the exact intent that they had had. And what they saw was immediately, Zebrium pops up a root cause report that tells them exactly the root cause that they took over a day to find. These are the kind of stories that let us know we're on to something transformation. Yeah, that's great. I mean, you guys are jumping up and down, I'm sure we're gonna hear from Cisco later. I bet you they were jumping up and down too cause they didn't have to do all that heavy lifting anymore. So, Rod, Larry's just sort of implying that you're, or actually you guys both talked about that your tools agnostic. So, how does one actually use the service? How do I deploy it? Yeah, so let me step back. So, when we talk about logs, right? Like, you know, all these breadcrumbs being in logs and everything else. So, they are a great wealth of information but people hate dealing with them. I mean, they hate having to go in and figure out what log to look at. In fact, we had one of our, or we've heard from several of our customers now prior to using Zebrium, but when they're have some issue and they know there's something wrong, something on their dashboard has told them that something's wrong, maybe a metrics has taken a blip or something's happened that they know there's a problem. We've heard from them that it can take like a number of hours just to get to the right set of logs like figuring out over these hundreds of services where the logs are to get to them maybe searching in a log manager just to get into the right context even can take hours. So, that's obviously the problem we solve but we don't want them just looking at logs. I mean, we don't wanna put them back in the thing they don't like doing because people don't do that, they don't like doing it. So, we put it up on the dashboard. So, if something is going wrong with your metrics and that's the indicator or maybe it's something with tracing that you're sort of digging through now that you know something's wrong we will be right on that same dashboard. So, we're deployed as a SaaS service, you send us your logs, you click on one of our integrations and we integrate with all these tools that Larry's talked about. And when we detect anything that is a root cause report it will show up on your dashboard in the same timeline as those blips in your metrics. So, when you see something going wrong and you know there's an issue take a look at the portion of your dashboard that is us and we're gonna tell you why. If we're gonna get you to the why that went wrong. Not no other work to be, you can also click down and click through to us so that you end up in our portal if you wanna do some more digging around if you need to or whatever maybe to get some context, what have you. But it's fair that you ever need to do that. The answer should be right there on your dashboard. And that's how we expect people to use it. We don't want them digging in logs and going through things. We want it to be right in their workflow. Great. Thank you Larry. So Rod, we talked about Cisco. We're gonna hear more from them in a moment in Seagate. I would think this is like a perfect solution for a SaaS provider. Anybody doing AI ops? Do you have some examples of those types of firms leaning into this? Yeah, a couple of great ones. I mean, we got many of them but a couple that I'll touch on. We have an actual AI ops company that was looking for sort of some complimentary technology and so on. And so they decided to just put us through our paces by having one of their own SREs sign up for our service and our SaaS environment and send the logs from their system to us and just see how we did. So it turned out we ended up talking back to this SRE like a week after he had installed the product or signed up and started sending us logs. And he was him and Han saying that he was busy like every SRE is and that he didn't have a chance to really do much with us yet. And we just having this conversation on the phone and he comes to tell us that, yeah, I've been busy because we had this terrible outage like five days ago. We said like, okay, did you actually look on the Zebraium dashboard? And he goes, you know what? I didn't even think to do it yet. I've just been so busy and frazzled. So we have an integration with that company. He hadn't put that integration in so it wasn't in his dashboard yet but it was certainly on ours. So he went there and he looks and he looks on the time range of when he had had this incident and right at the very top of the page on our portal was the incident with the root cause. And he was flabbergasted. It literally would have saved him hours and hours and hours. They had this issue going on for over 24 hours and we had the answer right there in five minutes. I mean, it was crazy. And we get that kind of story. It's just like the sea gave them. If you use us and you have a problem we're going to detect it. And you're going to hear from Cisco how successful we are at detecting things. I mean, it'll be there when you have a problem. In SaaS companies, one of our customers is Archera. They do cost optimizations for cloud properties for AWS optimization, Google Cloud and so on. But they use our software and they have a lot of interaction obviously with these cloud vendors and the APIs of those cloud vendors. So in order to figure out your costing at AWS they're using all those APIs. So it turned out they had some issue that their services were breaking and we had that root cause report right on the screen again, within five minutes that was pointing to an API problem with Google. And they had changed one of their APIs and our chair was not aware of it. So their stuff was breaking because of a change downstream that we had caught. And I'll just tell you one last one because it's somewhat related to one of these cloud vendors of big cloud vendor who had an outage a couple of months ago and it's interesting because a lot of our customers will set up shared Slack channels with us where we're monitoring or it's seeing their incidents as well as they are. So we get a little Slack representation of the incident that we detected for them or the root cause that we detected for them and that's in a shared community channel. So we could see this happening when that AWS outage happened we could see our customers getting impacted by that AWS outage and the root cause of what was going on there in AWS that was impacting our customers. That was showing up in our incidents. Now we didn't obviously have the very root cause of what was going on in AWS per se but we were getting to the root cause of why our customers applications were failing and that was because of issues going on at AWS. Interesting, I mean, I think one of your biggest challenges is going to be getting people's attention because these SREs are so busy, their hair's on fire. I tell you, if it gets their attention, they love it. I mean, this AI ops company, I didn't even tell you the punchline there but they had this incident that occurred that we found and quite literally the next week they ended up signing up as a paid customer. That's great. And Larry, I'll give you the last word. I mean, Rod was talking about changes in APIs there's still a lot of scripts out there. You guys, if I understand it correctly, run both as a service in the cloud and you can run on-prem, which is important because there's a lot of sensitive information in logs and people don't want to leave but close it out here. Yeah, I mean, that's right. You can run it on-prem just like we run it in our cloud. You can run it in your cloud or on your own infrastructure. Now that's all true. You know, I think the one hurdle now that we have left as a company is getting the word out and getting people to believe that this is actually possible and try it for themselves. You don't believe it, do a POC, try it yourself. And it's, you know, people become so jaded by the lack of, you know, real sort of innovation in the software industry for the last 10 years that it's hard to get people to, but guys, you gotta give it a shot. I'm telling you, I'm telling you right now it works and you'll hear more about that for one of our customers in a minute. But guys, thanks so much, great story. Really appreciate you sharing. Yeah, appreciate the time. Okay, in a moment, we're going to hear from Cisco who is the customer in this case example and a company that has, they have quite an impressive suite of observability tooling and they've done a pretty compelling proof of concept with Zebraium using real data on some Cisco products that you've heard of, like WebEx, so stay tuned and learn about how you can really take advantage of this new technology called Root Cause as a Service. You're watching theCUBE, the leader in enterprise and emerging tech coverage.