 Also, thank you for that wonderful introduction, since I don't have to spend a lot of time introducing myself. But yeah, let me quickly share my screen and get this train started. All right. So I hope everyone's having a fantastic day so far. Don't worry. You have not stumbled upon a wrong stream or into a wrong universe. I'm still very much going to be talking about cloud-native observability, logging metrics, although there's a picture of logs here. Because I really have no idea about logs in the real world, like wooden logs, and anomaly detection of something that I have zero knowledge about is a little difficult. As I was introduced earlier on, I'm a senior technical advance list at SUSE. But before my pivot into the cloud-native ecosystem, I was a systems administrator and dear sweet lord, how much I hated outages and being on call in general. I would be suspicious if anyone actually really liked being on call or liked outages happening. Because it's literally like having an axe around your neck if you're on the support side or on the development side when this happens. Jokes apart, though my experience in the cloud-native ecosystem also pretty much resembles or rather the experience that I had previous to cloud-native pretty much set the tone how I ventured into cloud-native as well. I stumbled onto the chaos engineering and observability space way before I actually started dabbling with Kubernetes as a project. And speaking of Kubernetes, I am one of the docs maintainers on there. So that's a bit about me over and about what was introduced earlier on. But coming to the agenda for today, I know I kind of said that we're not going to be talking about wooden logs. So that's like the tip of the iceberg when it comes to setting the context. But we have to go a little bit deeper to understand what is it exactly that we're going to talk about and why is OPNI relevant in that context. We also need to know what is the problem that it actually solves because you have nearly 170 plus projects on the cloud-native landscape that are open source. And why do you need another one? So that's the second section. Then we will walk through the steps of installing OPNI on Rancher Desktop. Now this method has its own pitfalls, which we shall cover in that section as well. But if you are experimenting with a POC or something, I would recommend that you install it directly on Rancher, Instagram to desktop. Then we'll go through what an OPNI dashboard and the admin UI looks like. And we'll see a couple of cases where actually things could have been recognized and flagged ahead of time and where these anomalies crop up and how they are displayed on the admin UI. And last, we will delve into the project roadmap. I forgot to add the sections of resources and thank you, but I guess that's pretty self-explanatory. So without further ado, let's dive right in. Now I've been talking for I think around five-ish minutes, maybe less, but I still think it's a very safe space to say that outages really aren't that great of an experience. And they truly suck if they happen when you're on call, when you're working. Like if outages were an experience advisor or any of the other websites, I would 100% not recommend it. But unfortunately, as an industry, it's a part and parcel because like their creators, software is also palatable. Going back to my days as a sad man, I can say that you could pretty much decipher what is going on, what is going wrong with the software, or what is going on in a software by looking at the logs. Because they are the way your software pretty much communicates with you. We obviously have metrics and traces. I'm not going to deny that. But reading logs is one of the best skills that you can have when it comes to your software industry. Because they have such rich information contained inside of them that you can possibly catch problems before they actually occur because that warning message always is there. But unfortunately, logs aren't the easiest when it comes to readability and accessibility. And it's also the reason why logs aren't preferred when you talk about using them for debugging issues. You ask anybody, people will prefer a dashboard to a log any day, any time. But what if we could make that experience simpler? What if we could actually have logs sifted through for us? Because gripping is a great idea. But if you have tens and thousands of logs, gripping is a very tedious task. Take it from me, I've done that. But what if we could actually simplify this whole experience and enable users to read logs and capture issues before they actually become issues? So that is what OPNI does or at least aims to do. Because OPNI hasn't yet reached GA. We're still building out our capabilities. And it's a tool in the cloud native landscape for multi-cluster, multi-tenant observability. It's built at all Kubernetes. So when we talk about observability in the cloud native space, like I said before, there are a ton of tools. And I'm not even going to throw up the CNCF landscape because I do that in every presentation and I cannot see the landscape on the screen myself. So I think it's pretty useless right now to put it here. But you get the gist. The landscape is literally growing at a fast pace. And you don't need just another tool to come into the landscape and provide you with the same capabilities. So what are OPNI's capabilities that help in making it stand out? So whatever tooling you see on the landscape, they sort of fall into a couple of subsections. One will be to either visualize. Another will be to store your metrics or store your data. Another might be to aggregate your data. And some might collect your data. But there's not like a whole package so far available that combines all these things. And add to this the fact that logging is totally not something that you account for when you are talking about all this. Logging is still very much the neglected child in the observability domain. And it's pretty much not looked at because of the issues I mentioned before. OPNI aims to fill this gap by being the complete package, if I may say so. And it does that by creating and managing your backends, your agents, and your SLOs. And it also manages all the data associated with your logging, your metrics, your tracing. And what is like the cherry on top is the fact that it comes built in with AIOps. Now, I know AI and ML is like a very intimidating topic to a lot of us here. I am personally very intimidated when somebody comes and tells me stuff like I am learning machine learning because I think that person is a genius if you talk something very, you just talk even like the very basics of AIML. I think you'll be a genius is the inference. But with OPNI, this comes built in with AIOps. And as we shall see in the next couple of slides, you don't actually have to have knowledge of AI or ML to work with this. And this is pre-trained for the Kubernetes control plane, the Rancher clusters, and Longhorn. So we plan to eventually add metric anomaly and root cost detection as well to this whole thing because log anomaly detection, we realize is just like one part of the puzzle. So there are a lot of things going on here. And that is why it's important to understand how it is different. So first come first thing is that it's open source. A lot of tools in the AIOps space, if you go to look at it, are proprietary. I think we are one of the very first projects who are open source. Like I said before, no knowledge of AI or ML is required by the person operating it. The models sort of train themselves, and you're happy. They're happy. Everyone's happy. And we've designed it in such a way that it does not require a huge volume of logs to get started. And it's coming with pre-trained models for all these specific distributions. And the aim is to incorporate all sorts of ensemble variations of these distributions and other distributions so that we can provide a proper subset to literally every customer or literally every user of the project. Now that all sounds great. I know that I just sounded really marketing. And people will be curious to know how this is all translated into technical stuff. As aforementioned, it's built on top of Kubernetes. And the two main components, like with pretty much everything that's managed Kubernetes is your upstream and your downstream cluster. Now, your upstream cluster will have a gateway component. And your downstream cluster will have an OPNI agent component. So the gateway is one of the main components for OPNI in the upstream cluster and in the installation as a whole because it's very powerful owing to the number of servers that it comprises of. So when you're talking about accessing OPNI via the CLI or via the dashboard, you are going via the management server endpoint. When you're talking about the gateway connecting to the agents, you're talking about, say, the agent establishing a stream, bidirectional stream at the GRPC server endpoint. So there are many API servers. I think there are four. Not I think, but there are four. There are four API servers which are responsible for a lot of the workload for which, you know, of OPNI. And what essentially, you know, from a user workflow perspective, OPNI does is you have a CLI or dashboard via which you can actually access OPNI and inject false and figure out where that anomaly is coming up. The dashboard is a great way to visualize as we shall see in the demo. Now, this is an oversimplification because obviously API servers just doesn't cut it, right? Like if everything were just API servers, it would be really cool. But somebody has to do all the collecting and all the actual overhead work of collecting all the stuff, storing it and all of that. So that's where I'm going to zoom in into this architecture. When you talk about your upstream OPNI cluster, you have it on a Kubernetes cluster. Here it's depicted as Kubernetes, but like I said before, it's, you know, it works best with Rancher. When you install OPNI for the very first time, it'll install the admin UI and the OPNI gateway. While the admin UI is used to create and manage backends, SLOs, the gateways used to establish that communication with the downstream agents. At the backend side of things, you have the logging and monitoring backend. The monitoring backend is powered by Cortex and Grafana deployments, while OPNI leverages open search for its logging backend, wherein it becomes easier to search, visualize, and analyze logs from the Kubernetes control plane. And other, you know, and from Rancher and other places. Although not shown here, there's also an alerting backend which comprises of the alert manager deployed as a fully managed stateful set. The management is done by OPNI. But what about the agent? Currently, you will need to install the OPNI agent separately on the downstream cluster, either via your Rancher UI or via help. The recommendation is to actually perform it via Rancher UI, which is why I said that demo with Rancher desktop will be a walkthrough. And once you've installed the OPNI agent on the downstream cluster, you also need to ensure that the corresponding backends are enabled so that, you know, this entire setup works. Right. But what about AIOps? I've not come to the start of the show yet. We've been talking about the architecture and I started off this whole conversation with, you know, oh my God, logs are so difficult to read. We need more things that help us with reading logs. As you shall see in the second demo, there are broadly two ways of, not two ways, are two ways of how we leverage AIML for log anomaly detection. One is obviously using pre-trained models and you have those pre-trained models available for certain distributions. And if you're looking to start, just like me at level minus 50, this is a great place to start. But if you have a GPU or two and you want to sort of, you know, dive in right at the deep end, you can do that too. You are able to learn and self-train models based on your workload logs if, you know, you have a GPU to spare. But coming to the AIOps part of it again, the machine learning method used is drain, which is one of the very popular log parsing methods available in the machine learning system. And it learns from incoming log messages, which also enables it to detect changes in environment very quickly. And we've adapted drain to actually, you know, be an anomaly detector for logs. And it's also what controls the trigger for, you know, our deep learning model. Now, the deep learning model, deep learning method that we use is new log. It's based off a popular paper and these resources are linked at the end. So don't worry about it. This one requires GPU. So deep learning requires GPU and is a sequence-to-sequence model that basically learns semantic contextual information from your log messages. It's extremely accurate and it typically needs a large amount of data to start with or at least a steady state, right? But with the OPNI project, we wanted to make it a little more simpler for people to get started with. And that's why we have designed drain to kick off deep learning only once the steady state is achieved. So drain kinda is like your manager in this scenario. Now, this is a use case in the wild and from one of the real K3S outages that have actually happened, we had our support actually work on this outage and I can definitely say the humans took a lot more time than the OPNI project did to figure out where the issue was and what the root cause is. So out of the 45 minutes that was spent by OPNI to figure out the blocked text, I think 30 minutes were spent in training the model itself. All right, so if also in addition, I'm not sure if you can see the spike here which shows that there was an issue at around 10, 29 a.m. The graph is not that clear I realized, but at around 10, 29 a.m. the issue actually cropped up on the anomaly insights and in the logs and the reported issue was at 10, 31. So the logs actually forewarned the fact that there was going to be an issue and if this was deployed and had it known, there had it been known that this was going to happen via this inside box and had you said the appropriate notifications to be sent out. The issue could have been, I wouldn't say avoided because that would be an extreme ideal case but at least the turnaround time would have been better. Honestly speaking with better sophistication, the numbers would have definitely seen a drastic decrease and obviously lesser manual overheads would have been incurred during the entire process. Now we shall look at installing OPNI on Rancher Desktop. The pitfalls of this is that Grafana doesn't work as well over here but you can share the steps to just get a feel of how it works out in this case. So first up, we need cert manager to be installed on your local machine or wherever before which the very first step would be to install Rancher Desktop. I should not have assumed that but that's my folly. You need to first have Rancher Desktop installed. Once that's done you install your cert manager then you customize your installation. There is a sample values.yaml file that's available in our docs which is at www.opni.io. That sample values.yaml can be customized by adding the hostname, selecting the authentication provider and then clicking save. And then you add the OPNI Helm repository. This is all fairly simple. Once that's done you install the OPNI CRD chart and then you install the OPNI with the values.yaml you previously had actually customized. So this sort of installs OPNI on Rancher Desktop. Like I said, it's just an installation to start off with. I wouldn't recommend this even as like a POC to demonstrate its value prop because it works best with just Rancher Desktop. So that's one thing. And yes, the demo. So I'm just gonna stop sharing here. And yeah, I'll just quickly show you the insights and this thing I realized I'm running out of time. So yeah, just a second. So this is what your OPNI dashboard sort of looks like. And we are, you can just directly go to visualize your. So if you can see these are the various anomalies reported and you can, this is anomalies by Kubernetes components. You can also visualize by control plane logs and your Rancher logs breakdown. I'll just show it to you here, but this is the thing. Now, when I say that, yeah, let me just go here. So if you look at this, it'll show you all these things on the dashboard side of things. And you also have the discover side, discover feature wherein you can see these log entries. Typically you would not really look at these log entries and figure out what could go wrong. So these log entries are extremely valuable in case an actual outage occurs and it would be helpful if that was not the case. If you could read logs better and if you could actually see what they had written in their log, what is being written in them. So this actually you can see this graphically visualizing and you can see a lot and if you require help with setting anything up, please do reach out. I think I have very less time to show you the rest of the stuff here. But coming back to the project roadmap, I will just sort of again, just a second. Right, so the project roadmap would be to have managed open telemetry collector for logs, metrics and traces. Some of the work is already done but most of it will be completed by the time KubeCon is in full swing I guess. We also plan to implement open source, open search as a vector store for AIML applications and a chatbot powered by large language models. All of these items that I've just discussed are available and are visible to everyone by navigating to this project board, so please feel free to check it out. And these are the resources that I mentioned, those papers and everything else. I've also included a link to the docs and to the Slack channel, so please feel free if you have any issues setting up or anything else, any questions to join our Rantz Slack and reach out to us. We'd be happy to help you, but that's it for me and thank you so much to the entire KCD organizing team for inviting me on here. Thank you.