 Good afternoon. Thank you for coming to this session. My name is Larry Krovalo. I'm filling in for Diane Mueller and other speaker but I'm going to just introduce Vinothini Raju, who is the winner of the Women B2B Tech Award last year, and she's going to present a fantastic slide deck and a demo on JNAI in operations. So with that, Vinoth is going to introduce herself and introduce the slides and the concept and I'm sure you're going to go from out from here with a lot of good information on how to use JNAI in operations, you know, up to you. Thank you so much, Larry, for a great introduction. Hello everyone. Very good afternoon. Looks like we have a full room here and so excited to see you all and I'll try my best to keep up to the expectations. So what are we going to talk today? JNAI, wow, that's a happening thing now and we are all in the Kubernetes space and I'm trying to put these two technologies together. So I'm going to be talking about using JNAI to instigate precision and efficiency in Kubernetes operations. That's a pretty loaded topic. So what I have done is created a short agenda to give you a heads up on what I'll be covering over the next 20 minutes. So I'll do a short introduction to the Kubernetes landscape and the JNAI and talk about some of the challenges and the solutions that are available today and I'm going to propose an AI governance framework for IT operations and I'll walk you through a troubleshooting use case with a simple Kubernetes issue and then I'll show you a complete solution demo of how this whole thing will fit in, followed by a quick Q&A. So just a quick word about who we are. So we are a low-code platform company. We are building an IDE for Kubernetes, making things simpler for Kubernetes users. We are headquartered in Bangalore, but I live in Canada and we have a distributed team. We started off as a DevOps consulting firm offering CICD automation for retail customers across India and US and we also parallely contributed and maintained this project called Configurator which helps you version control config maps and secrets on Kubernetes. So since 2017, we are all platform company and we are on different marketplaces thanks to all our partners. So we have about 1,600 plus downloads across all these marketplace distributions. Let's dive in. AI for Kubernetes, what's the necessity and how AI can help? So this is a slightly an outdated survey that was conducted by CNCF in 2022 but yeah, the problem still remains, right? The training is one of the strongest inhibitors for the Kubernetes adoption and even if we are on Kubernetes, still troubleshooting and maintaining is really complex. Upskilling is difficult. But at the same time we are seeing a huge explosion of AI solutions and we are readily able to see the value that it brings to table. So it's a great opportunity for us to put these two things together and find the value immediately. So we did a short survey last year across the globe. A lot of platform engineering folks participated in this, DevOps engineers participated in this and we got about 48 responses to see where AI can contribute in solving Kubernetes issues. So we can clearly make out like validating configurations, troubleshooting failures, monitoring and detecting anomalies and then detecting security issues are four main areas where we saw like AI can add a value. But then what is preventing us from doing that? So before we're getting into that, I really want to know how many of you are really evaluating AI in their operations? Okay, so how many of you are having challenges in your organizations due to compliance issues? Okay, okay, so is Gen AI really suitable for IT operations? That's a big question. I came across this is a very interesting code from Santosh Bimpala, a computer science professor from Jargiatec. It's a language model is just a probabilistic model of the world. So it's not deterministic. So which means it hallucinates. It can give you false positives. And it's not real-time, so we may not be able to get all the real-time information that is available out there. And we cannot make decisions based of what the AI responds because there is a lack of accountability. We need a constant supervision of what happens with the AI and how we consume it. The first part of the problem, I think we can very well address through some of the things that we can outright control, for instance, the prompt engineering, the models that we choose are built, the temperatures which actually decide how creative AI can become. So these attributes to an extent we have a better control and we can implement something called the retrieval augmented generation called RAG and have SIRP APIs which can search Google and then give you real-time data by creating a pipeline with these technologies. We can have some real-time and domain specific information as well. So I'll walk you through this RAG in a while, but then yeah, this problem to an extent can be solved. So most of the times we talk about the hallucination and getting the real-time data, but often we miss out on the last problem, which is very critical for operations, which is the lack of accountability. So we need a strong AI governance if we have to consume AI within our IT operations. We cannot just let AI make the decisions for us. So for instance, let's go over each one of these problems and I'm going to give you a very simple example. I could have chosen Kubernetes issues, but here I'll do that a little later. I want to show pictorially what it signifies. For instance, I have used the chatGPD to create a flow chart and it gave me some random image. Then I changed the prompt to create a line diagram and it's able to show something different, but it's still not to the expectations. So the last one is I changed again the prompt with a little more context of what I really expect out of AI. And this time it had some meaningful outcome. Let's compare this with Lama 2.7b and you see across all these it was not able to generate anything that is meaningful. But then when we have a fine-tuned Lama 2, along with the proper template, we were able to get a better response. So what this signifies is that we need a good prompt, we need a fine-tuned model which is very domain specific. So here I'm proposing, okay, we have seen the model and the prompt. Now let's move on to the governance and I'm proposing a governance for AI where it gives us a certain level of comfort. For us to consume it in our operations. In Kubernetes landscape, we have so many tools, third-party tools, and we have to collect the context across these tools in order to put forth meaningful prompt. So there is a cognitive overload in generating those contexts. The second aspect of this, before we share anything with the AI, we'll have to redact the sensitive information. And then I haven't come across so many of the solutions where the user gets to review and approve anything that goes to the AI. So we need this, this is a very important step, so we have a better control of what happens and what we actually send it to the AI. And once we get the response from the AI, we have to tag those responses to say this is generated by AI and we need to use it with discretion. The final step is to maintain a complete history of all the responses and the inputs. So we can go and do an audit at a later point in time if at all we have to look at things. All right, so I think we have summarized, to summarize what we have seen, we have seen the prompt, we have seen the models, we have seen the workflow or the governance. Now I'm going to do a simple use case of how do we go about troubleshooting an issue in the Kubernetes and how the AI solution can help. So here I have a simple resource which is a Java application deployed as a pod and it actually has a command to run app.jar, which is actually not present in the image. And there is a volume mount that mounts that jar file into the container. Since the jar file is not there, the container doesn't start and it has unable to access the jar file. So I'm going to propose a pipeline architecture where we will be able to solve this, you know, troubleshooting issue. So I have built a data set from different Kubernetes sources, scraped the data, fed it into the Pinecone DB. And then I'm using a Lama to 7B base model. I used retriever QA and a prompt template specifically built for troubleshooting. And whatever we get out of this rack pipeline, I'm doing a chaining of the prompt and sending it to the representation layer, which is mostly for the UX. So I have not included the workflow here. I'll show it as a complete solution later on. But then I want to show you the difference of using a plain base model versus the rack pipeline. So this is what Lama to responded. It came up with three solutions. Check the container has enough memory, which is nothing related to the problem. Verify that Java command is currently configured. Again, it's not related. Then check the class path. Yes, that could actually contribute to the problem. And then, you know, check the network issues and things like that. But most of it is incorrect because it's not a fine tuned model. But if I have to fine tune or build a model of my own, then it takes a lot of effort. So rag simplifies that you can just collect the sources, build a, you know, a data set, and then you can have this really quick. This is what we got from a Lama to seven be along with the rack pipeline. So all this three scenarios fit in very well with the problem that we have in hand. The first one it says missing or incorrect file path. Then we have an incorrect jar format. So it could be corrupt. And you may have an insufficient permissions or access control issues because we are doing a volume mount. Now, this is quite promising. And this is exactly what we should, you know, count upon. Now, let's see how ID, a local ID can actually work with AI and have a complete solution on, you know, the troubleshooting local actually bring can bring in some advantages. For instance, you can have a templated workflow. You can, you can actually have all this context integrations with third party tools that you can pull in readily and have a context available for the AI. It can have an efficient prompt template. And the benefit we get out of that is a quick issue resolution. We can build a context of our knowledge base, which I'm going to show you in a while. And then you can also have automation around support ticketing and other stuff. So you see here on the left hand side, I have a catalog of different, you know, the context. For instance, the resource specification events, logs, metrics, and so on and so forth. And then we can feed that back into the chatbot and whatever outcome that we get, we can capture that in the form of documents. So the runbook talks are essentially context aware documentation that goes side by side with your Kubernetes resource. So it's very handy. And then you can pull in that information really quick. And then you can also have a G dot ticket. You can direct that to a G dot ticket and have some automation around that. Okay, so I'm going to show you a demo of what I have described so far. I have a recorded demo as I couldn't show you a live demo. Just give me a moment. I'm sorry if the resolution is not good enough, but I think it should be fine. Okay, so the, I want to first walk you through the ID itself, because that forms the foundation for bringing in the context. So this is an ID where you can filter resources, have a live view of all the resources, you can have consoles. You can have the logs. You can have multiple logs open. You can edit the armals and then maintain. You can also have a, you know, a schema form, which is a low code editor with side by side documentation of modifying your resources. So let's talk about the AI here. I have two AI assistants are registered. I have a chat GPT and a llama line chain. And I have a code repository which has one run book in it run book hub, basically, it's nothing but repo with the dot GP YAML file with the specification of different selectors and what documentation goes with that. So any resource that matches this selector will get tagged with this document. So you can also build this from the solution UI or you can have this YAML file. You can build it from scratch. So I have to run books within this hub. So let's take a look at it in the UI later. So I also have one Gira account. Now you can see that the run book hub has two documents to run books in this hub and it's attached to this cluster. And in the part pending, it gets attached to anything that is of particular label. So this documentation is generated by AI. So we have a disclaimer at the bottom of this. Whenever a new content is modified or generated by AI, we have a disclaimer saying that this is generated by AI. So you can actually use it with caution. So we have tagged this. Let's go back to the cluster and see how the resources get tagged. So I can view the resources and I can filter them based on the run books. So I have one image failure container, which gets tagged here, and you have a side by side documentation to see what could go wrong. So you can use all the context that are available on the left-hand side to pull in the information and then enhance this even more. Okay, let's see another use case, the Java issue that I talked about. So I have the application deployed in this cluster. Let's filter that resource and then see how we can go about troubleshooting. So I filter based on the pod and I fetch the pod called Java failing or something. Yeah. Okay, so let's open the chat UI and the redactors on by default. So anything that we add into the context, we are going to redact it. So you have a catalog of all the logs and metrics that you can pull in. So I'm going to add the resource specification. And you can see that it's all redacted. You can review and you can approve later on later on. So it's all redacted. I approve it and then I add the container logs as well. And I'm going to approve that as well. And then we can add events metrics. So I'm going to just show you how the metrics would look like when you add it. So I'm adding an event and I'm adding a Prometheus metric. I know it's not related, but just to show you like it's possible to bring in the metrics into this. So yeah, it's not a related issue as such, but you can bring in anything. Yeah. So you get a graphical view of all the metrics that are there. And then we can do troubleshooting. So I'm using chat GPT and you should be able to get the troubleshooting tips from Chagipiti. So we have an inbuilt template for the prompt. So this is what we see. Right. So it gives you a consolidated list of things that could go wrong along with the expected versus the actual values. So it's able to clearly pinpoint the issues. This, you know, the expected is that's expecting the jar there. It's not present with a list of commands. And it also gives you a flow chart of how the whole steps can be executed. Right. So this is with Chagipiti. It did not get confused, even if I give add a lot of noise to the context. And then you can see the list of options that are possible from this chat. So you can create a Gira ticket right from here based on the content generated. And then you can also create run books from here. So it created a Gira ticket with all the context in the form of an attachment to the Gira ticket. So it's pretty much automated. And then you, you can also create a run book from here. Yeah. So you would see some random name. That's because I confused the Chagipiti with the Prometheus alerts. Otherwise it would have given me a much more meaningful title. Yeah. So let's see how. So you also see a list of history of all the chat that we had. So we can use it for auditability later on. So let's move on to the Lama output, like Lama integration that we have and see how it performs here. So I have the same Java failing pod and I'm going to use the Lama chat. I'm going to add just the specification and the logs. So I'm adding the specification and I'm adding the log as well. And then we do a troubleshoot. So it's just the same example that I showed, but with a better representation. So all these four causes, three causes that it showed, along with the list of the workflow. So it could fail for part reasons, jar format or volume mounts. And it gives you a neat representation of how you can go about solving this problem. Now, this is the end of the demo. And I hope you liked it. Let's get back. Oh, yeah. Thank you. Wow. Thank you so much. Okay, so I'm going to just summarize what we have spoken so far. So we have seen the AI challenges. We have seen the prom, a model in governance can make a huge difference. I showed you an example of hybrid fine tuning with rag and how low code and AI can go hand in hand and troubleshooting issues. So I'm going to stop here for Q&A. But before that, we have a SAS version. I'm just announcing the SAS release often. If you are interested, you can go subscribe. At a, you know, very basic level, you don't have to, you know, there is no payment or anything. You can just go try this online. It's available as if you go to the website, you have a start free trial and then you can go and sign up and you will be able to use chat for your own cluster for one cluster. But if you're looking for anything on premise, this version is still not available on our marketplaces, which we'll be releasing very soon. But we do have an early access. You can get in touch with us and we'll give you the light or the community edition with some enterprise capabilities too, but it's again going to be on chat GPT. If you want an enterprise edition, you let us know and then we can work together on the Lama version. And if you have a rag pipeline in-house, we can also integrate with that. You can find me on Twitter at Vinotha Niraju and I have kept my DMs open. If you want to contact me, you can DM me. And if you liked this, please use this tag and tweet about the solution and that's going to be a good encouragement for us. If any of you are watching this live video and you couldn't make it to the event in person, please do share your feedback as a YouTube video. I mean, you can send in your comments and then we will take a look at those comments as well. All right, so I'm done with the presentation. I mean, we'll open up for the Q&A and. Yeah, is there any questions? Yeah, go ahead. Yeah, I think you can thank you very much for the wonderful presentation. So the model it generates or corrects does it actually improves it and stores it improved model somewhere. Okay, so I have not done a fine tuning of the model. I have done a rag pipeline. Let me go back to that slide. Okay, yeah, so here I'm just scraping the data. I have a data set in a hugging face. And then I use the base model for integrating with the vector DB, but we can do a fine tuning, which is again, we are working on it. You can still have a fine tuned version. Either you can have it in the, you know, hugging phase or you can have it locally as well. Yeah. Yeah, so the code, the low code it generates, can you actually view that and improve upon that? You mean the response? Not the response because the whole UI or it's behind the scene is generating some code. Okay, the templates you mean. Yeah, the templates right now, we are providing a template ourselves. But in the chat, it's not limited to just the troubleshooting or the optimization, you can give it, you can also have a free form, which means you can send your own, you know, content to it. But at this moment, we don't have a template option where you can create a template and use it for your chats. That's something that we will have it in a, in a few releases. Yeah, thank you. Yeah. Yeah, but no question back there. Thanks a lot for the presentation. In your example, you show that you sort of using the AI a problem that is, let me say related to something very general because it's referred to a missing file. If someone wants to reach more of the model you used it, for example, for his own purpose for his own particular application, is there any chance to inject a custom model or a customer reaching the pipeline to identify some problems also in a custom application? Yeah, that's a great question. In fact, like we have looked up on some of the existing models and we did not come across any model that is fine tuned for this. And even if it did, it did not really work for us. I mean, it was the answers were not really accurate. But, you know, we are kind of looking into fine tuning it. We are not building a model of our own. We are looking to fine tune it. But however, like if you have your own models, you can bring in, I mean, in case of our enterprise version, you can have your own models and we can do the integration with that. Yeah. Thank you for the presentation. Who do you imagine is the target audience for such a tool? Is it support engineers or developers or users or the operators themselves? Yeah, I think the primary audience is the IT operations. But then, you know, as DevOps stands DevOps is for everyone. We have the IDE for like the low code and all those seven even developers can use but our primary audience here is the IT and SRE folks. Thank you for presentation. During your demo, you needed to type in approve command several times. Can you clarify what was the purpose of it? Like is it some security and compliance reason? Yeah. So as I said, like we don't want AI to decide things for us. We need to have a better control of how we are interacting with the AI. So unless we review and approve the content or even the solution doesn't want to assume that just because we have redacted, we are okay to, you know, send it across to the AI. We don't want to automate that. We want to make sure that, you know, the users are in control of what they really send it to the AI. So unless they review and approve, especially because it's, it's actually happening in your environment. We want to make sure that we are not sending anything that is, you know, sensitive. Hi. Thank you so much for this talk. I learned a lot. And one thing I missed in the presentation, which you probably covered if you could review again is when the tagging of resources happen. Is it in a pipeline or does it automated happen or is it a human tag? I saw the tagging and I thought that's amazing. I wish I could do it in an automated fashion because my humans don't want to do it. Okay. So yeah, we do an automatic tagging because we don't, yeah, sometimes, you know, developers can forget and it can go oversight. It can be an oversight. So we want to make sure that that's automatically tagged. In case of a chat, anyway, we know that that's coming from the AI. But in case of runbooks, we, as soon as somebody uses the AI enhancement, we automatically tag it. Hello, thank you for your presentation. I would like to ask if it's possible to bring it to the next level, like for example, you know, to create some, to have some templates and maybe some custom models defined and then based on this one, as soon as we run the pipeline to create some alerts. Because I mean, if you are talking about 1000 of clusters or whatever, nobody will just, you know, in a manual way, check the UI. Is it possible also to integrate the outcome with some observability tools? That's a great question. But I'm, we have the API, so certainly it is possible. The reason why we have a approval process is to make sure that we are not letting anything that goes out of the control. I mean, from a technical point of view, it is certainly possible. But is that something that we want to do? Again, it depends upon organizations, if they are fine, I mean, we can review what kind of a content we want to send. If it doesn't have, if you're 100% sure there is nothing sensitive, everything gets redacted, we can still do that. Hello. Hello. Right there. Yeah. Yeah. Yeah. Thank you for the presentation. And I have a question about the performance of your product when you're dealing with exotic or custom resources, private customer resources, does your tool handle this well? Or do we need to retrain the model using our private resources? Yeah, that's again, a great question. So we have played around, played around with 7B 13B and we felt 7B was better. But then it was not yielding the results. We had to change the temperatures that we did. We brought in the data sources. And then we realized we have to add a lot more to it. And that's where, you know, to answer the other question as well. So how much ever we train, even if it is accurate to an extent, we are not letting the AI make the decision in the operations. This is, you can always look at this as a guidance and assistant, which is actually helping you. Another thing which we also realized is that when we give a noisy input, like if we have a huge prompt, especially, you know, if it is a system resource, it has all the status, managed fields and so many, you know, attributes. If we dump all the inputs to the AI, it was not performing well. So we have to fine tune the prompts itself. Right. So I think from our observation, if you have an internal database and of all this knowledge base and you can actually create a fine tune model, which will also improve the performance of the RAC pipeline or anything. Yeah. So but we have our own base version, which will work, which we have tested for system resources. But if you have a pipeline, we can do the integration. Thanks for the presentation and the slides. Demo, of course. You were feeding basically the model with all these data points approving them. But how do you know which data points to feed it? Where, how would the SRE platform engineer know that something's wrong, where it's wrong, and what to feed the model with to actually get to the problem at hand. Yeah. So, again, we have a few data points to start with, but we can always ask the AI model, what is the additional information that we are looking for. Right. So it can give back the suggestions and you can pick those from the catalog. Yeah, but how do you know what's wrong? How are you alerted actually that something's wrong that you start needing to feed that information to the AI. Yeah. So we need to have a starting point. I'm missing the starting part of the journey to actually bring the value to the engineer. So the starting point is where you're starting to see the errors happening. So for instance, in this case, we start starting an observability tool. We are not an observability tool. We are an IDE. Yeah, we are not an observability tool, but you can integrate with an observability tool. And we are not going to make a decision out of automating that, but we can always integrate with an observability tool. So the starting point is from where you're observing the problem. So from there, you can actually ask more questions and, you know, add more context. Yeah. Thank you. Thank you. We know that's a great demo. How do you create the Gira tickets right now? And are you using autonomous agents or anything like that? Or do you have plans to use them? We don't have the agents. It's a simple API integration. Yeah, thanks. So I think we have run out of time. If there is one more question, we can take it. Otherwise, please give a hand to Vinod, just great demo and great presentation. Thank you.