 All right, so I think it's about get to the time and let's get started. Okay, so I'll conduct this talk in English since we have some foreign guests here. My name is Qian and today I'm going to talk to you about how we use large language models to help us to simplify our cluster management. A little introduction about me. I started my career at Google as an SRE and in 2019 I moved to Ant Group to become a tech lead and manager for infrastructure SRE team. Basically I focused on say Kubernetes and other cloud native like solutions in Ant Group. As an SRE we wear different hats during our day-to-day life such as say we write automation programs to say to optimize our like toils and operations, right? And sometimes we do like fire fighting to handle production issues. And in the past few years we've seen a huge growth in our like cluster size as well as the pod size. But still we remain very fixed SRE headcount. So that means we need to seek any every help we have to help us to facilitate our day-to-day life. And this year I believe like language model is a hot topic. So we are trying to tackle this issue with language model. Okay, so before we get into the details I want to quickly clarify that this talk is not about like a GPT-101 course. I'm not gonna touch any pre-training steps or any other like algorithms for supervised fine-tuning like PFTs. And I know land-train is kind of popular these days. So, but this is not a land-train tutorial and this project has nothing to do with Kubernetes GPT project on GitHub. You know, in SRE what land? We handle a lot about failures and incident stories. So please think about this as a post-modern review or a progress report on our exploration to language models instead of any like best practice. There's no best practice at this point of time I think. Okay. I will conduct this talk in the following four sections. I will quickly go through our motivations about why we are gonna use language models and I will then do a quick definition on our goals and requirements and comes to our fun part about how we conduct different experiments or attempts to solve the issue. And hopefully you guys can all take some, like get some idea and get some takeaways. All right, so remember the topic today is about cluster management and here's our motivation. So imagine on a typical day like when you just sit down on in front of a desk with a cup of coffee and your pager just got alerted and it says multiple clusters were on fire and some notes were kept getting OM'd and apparently this time you have no choice but to jump into a war room to solve the issue because multiple business leaders was already yelling at you, say their business was getting impacted. And your big was it was watching behind you and asking you for the reason for the mitigation steps and et cetera. So you are desperate to find out, okay, so what are the correlations between these notes, right? Are they all from a specific CPU model or if they are on a specific conversion or a particular type of application was running on top of them. And you have a million assumptions here and you would like to quickly verify them. But I highly doubt that on the high, such high pressure and under such like emergency scenarios you can type those 2.0 command correctly in a single pass. And so that's why that's the first motivation. And it took us about an hour to actually figure in this particular case, it was from a demon set change that actually has some performance issue on a certain typical CPU model, okay. So in the post-modern review, one action item we agreed was to how can we shorten the time to pinpoint or to locate a problem, right. At that point of time, actually multiple tech leaders joined the war room and they contribute very thoughtful and very valuable information and their assumptions. But they are not from the Kubernetes team so they cannot actually get into the touch with the cluster management but their ideas were actually valid. So we need a solution to help them to let them can do some quick self-service like mitigation or their verification by themselves. And also like during the mitigation stage on this incident, we have to ask users to do some self-service operations like to scale up their deployment or to do some node reboot and some traffic drains. So these actions were actually not that frequently used in their day-to-day life, right. So they cannot remember all those commands correctly or like can quickly pull out all those knowledge they need. So we categorize this type of issue as the operational efficiency in the entire picture of the cluster management world. As you can see, it may just be a tip of the iceberg and we are facing many different challenges. I think each topic listed out here actually was a separate discussion but let's be more concentrated today. So let's go to the definition part. So remember, I think for us we conclude that what we need is actually a solution to help us can quickly transfer like human intention or human language to like actually machine API calls or whatever can interact with the machine directly, right. In the operational world, everything is just like a command to a specific system or like sending an API call or just tuning some like kernel like parameters, right. All those data are actually structured data. But human intentions, we have a million or a thousand way to like express them. So what can we feel in this question mark here? So before the language model like becomes a hot topic I think like we write all those automation programs we write all the like past web front end, all the solutions. You have like a website with a bunch of clicks and a buttons and you just type in your, so some parameters so that you can get for example, try to reboot your notes or try to scale up your deployment. So in this scenario, I think we are actually writing a user interface and that user interface actually does the translation between the unstructured data like between our intention to what the machine or the cluster management actually need the APIs, right. And since language models naturally takes the human language as an input. So we think, okay, maybe let's just try that. Let's use a language model for to be the new user interface. But using that in the SRE land is kind of like dangerous. As you can see, like we categorize the issue into, to use that we have to be very careful about the input and the output from the language generated by the language model, right. I think we listed out the four different criteria we need to verify if a model is valid and can be used in production. And first is about accuracy. The accuracy here is actually, we have a very high standard. I think it's about above 99% accuracy. Remember like you are actually operating with the cluster, right. If you want to reboot a single note A, but the language model generated answer to reboot a note B, it will be totally unacceptable, right. And also like the latency, I think it's we actually all want this thing to happen really fast, right. Otherwise you may miss the window to actually mitigate the issue. I know like if we experienced some like online, like chat DVD or whatever, all those chat models, sometimes the latency can be a little high, which is not that acceptable in our scenario. And also like for the security concern, we are not going to share our internal data to like any public models, right. You are not going to handle in your like internal security. They will be very vulnerable, right. So we have to like use our own model and at least we need to host it by ourselves. And for all those internal APIs and for all those like our internal operations, they are like involving really fast and language model has to like keep up with the pace of the API involvement. So that's the fourth criteria. Okay, so enough for the abstraction on this type of problem and now comes the fun part. So next I will share some experiments we did to play with the language model. So the first, the first attempt is about to do like API invocation. By the time we had this idea, I think at that point on land chain and auto GPT was attracting everyone's attention. And so everyone was talking about how can we make an agent to learn to use some tools to help us to solve a particular like domain, issues in a domain, right. So let's make an agent that can solve the issue for Kubernetes. So we quickly do the following diagram to express the workflow of, so when SRE speaks some language to express their intention, it can go through a planner model which is a language model that can help you do to plan the actions. And the actions can be executed by different executors and each executor is corresponding to a deterministic API call or a function that can interact with the machines directly, okay. So, so far so good, okay. And by the time we had that idea, we already like, we had a jump start. We, at that point of time, we already deployed a chat bot. We call it the ops bot in our like, in our production to help our on caller to try to solve their on call issues. So what we had at the point of time was like, if you say there was an alert fired and you got notified and you would like to do a simple restart and you just type in the chat bot, say okay, so restart dash dash component API server dash dash cluster full so that the support will do the restart for you. But at that point of time we had already deployed this bot in production for at least a year, but it was not frequently used and people were, people don't like it because they cannot remember all those commands correctly and they cannot type those commands like in a single pass. So here comes the language model, right. You would like to naturally think, okay, so if I can say just say help me do a simple restart and the model can understand and extract the named entity, let's say the component entity called the API server and the cluster entity and then filling the API call and just make the call. Okay, so I think the workflow is kind of very straightforward but we quickly hit our like bottleneck which is we cannot use the open models because open models does not have the internal knowledge from our perspective, right. So you don't know how we, in end group, how we do like a model with machine restart or how we form our APIs, right. We have to teach them. But for security concerns we cannot actually use the OpenAI or the ChatchBT. And luckily at that point of time we already started our like internal language model development, so we already had some like internal models to do like supervised fine tuning on top of them to see the result. So the question next is to generate all those training data you need, right. As you can see, the training data was about like a bunch of question and answer pairs where I can write some templates in a prompt and say, okay, so restart a node or I have different ways to express this single action, right, and you have that very fixed action format to fill in the model. And you then just replace the node with different entities in your domain. So here's our first experiment. We picked about 10 APIs from our like op spot such as like scale up your deployment or reboot a single node or like say help me retrieve the support in a specific namespace. And then we generate about like 40,000 question answer pairs as the training data for our fine tuning. And we then use like five different model candidates and we throw the data in and hopefully to see magic happen. And it did happen. And we picked three models. At that point, I don't think code llama or llama two was available. So we use GBD Neo and some GBDJ as a benchmark. And we have our internal models to do fine tuning as well. So as you can see, it did help us to like extract the named entity like their IPs like the pause and like the namespace as well. And also it can classify the intention like in a very good accuracy. So we quickly like made a prototype and replace our chatbot with this language model. Yeah. However, when we asked the user feedback, it was not that good. Actually the model failed in multiple aspect. And one thing that it failed to comprehend different constraint set by the users when curing for Kubernetes resources. I'll give you a few examples. And let's say, okay. So, you know, users may ask, help me retrieve the parts in appending status. Plus the part has been there for four hours or they want to get all nodes in a specific kernel version. Or they have all these kinds of strange questions that you cannot enumerate all of them in your training data, right? So I don't know if any one of you can write a single kube control command to get all these information. At least I can't. So, but let's see. So someone in my team said, okay, maybe we just try to train a model, right? To translate the human intention to the kube control commands. Plus some like shell script like to do some grab to do some AWK to get the result. But these ideas got quickly beat out because for the several reasons. First, I think shell scripts are less structured and it is too like flexible and too powerful. It can actually generate malicious data and can be dangerous if you blindly trust the output. For example, what if it generated command say kube control delete parts dash dash on namespace, right, say it will be doomed. And it's hard to validate, right? How can you evaluate if this shell script you have a million ways to write, to get the same result in shell script, right? So at that point of time, we had another like idea from our current team. Our DBA team, they had this called DBGPT, this idea. So the idea is about to translate human intention like in text to SQL language. So we think, okay, what if we can curate Kubernetes resources like what we do using a SQL format? Because SQL format can be more controllable and auditable, right? And we have so many like SQL data that we can use to do like fine tuning or do model training. So we had our prototype and we say, okay, what if we can do this like, instead of letting the language model to curate a cluster directly, let's add a cache layer. So the cache layer was like a controller and followed by the list watch concept that you can actually cache all the parts, all the nodes, all the resources you want. And we do some tricks to convert them to like, to a format that SQL engine can understand and can retrieve the data. And that's when we can get some like table schema from SQL, right, from the database schemas. And together with the user's question and their intention, together with the table schema, we throw them to the language model and hopefully it can generate the SQL that can be used to retrieve the data. Okay, so because of the time limit, I won't get into how we do the SQL conversion here today, but we can do it like offline or in another time. So we had this idea and we had this prototype, but still we are, the users were still not buying it because there are still too many knowledge, internal knowledge that are represented by, let's say the labels and annotations on those parts. You know, like everyone, especially for all those custom controllers or the operators, they love to patch like labels and annotations on those parts and nodes to make them special and give them like different meanings. So it seems like we get stuck here. So what can we do? How can we like improve this? So let's go back to some basics. So at that point of time, we say, okay, let's take a step back and let's see how test the capability of the language model. So we run two simple tests. One simple test is to say, given a YAML snippet, can a language model actually say extract the targeted value based on the label's name? I think this is a very decent and very simple task, right? It's just a key value lookup. And so we quickly like say, generated some synthetic data, let's use YAML snippet and we generate a bunch of questions either asking like one label value, two label values or mix a bunch of them and to see the result. And we use our internal model to do like fine tuning to test it. Okay, so these table and diagrams here, actually different people can have different interpretation from that and our insight from our insight. One thing is that the model's accuracy actually is determined by the coverage of the training samples. What does it mean? So about if the model can learn like 60% of the label like in the training data for the rest of the 40%, if the label itself was not shown in the training dataset, it can still like recognize them and successfully extract the value with a very high accuracy. So that means like the models are not just memorizing everything and it does have the ability to learn a pattern. So that gives us confidence. But in reality, I don't think anyone's gonna going to ask this question to the model, right? You can just do it by yourself or just do a simple graph to get the answer. But what if we ask if the language model can extract the target value based on the label's meaning instead of by the label's name, right? So here I give another example, let's say there is another like custom key called XYZ as a label and there's a background knowledge here. Let's say if that label means the pod is able to survive during a short-term system blip. Okay, so when the user asks if the pod is going to survive during a crouplet hot update, which is like a short-term system blip, right? The model should be able to answer with yes. Okay, then we do another experiment in this scenario. But due to the time limit, I won't get into much details but in this time, I think line chain frameworks was already popular and so we tried to use that and so to combine the background knowledge. So we use some background knowledge plus with the YAML data, plus with the user question and filling them to generate a bunch of training data to train our internal model. And also we know like everyone says there's a new job called prompt engineer, right? So we try different prompt techniques, for example, like the train of thought. They do actually help to get the results better. For example, as you can see, no matter which on the model size, whether it's 1.3 billion model size or six billion model size, with the COT and the memory and the background knowledge, it can actually answer the question better, especially for those labels and keys, the model doesn't see in the training data. Okay, so putting things together, right? So here's our final solution to say, how can you use language model to carry a single Kubernetes resources with different user constraints that set up? So what do we have here? Like we have a prompt engine that can do different prompting strategy like train of thoughts or self-consistency or anything you can mention. And the engine can actually interact with our internal knowledge database which can fetch corresponding background knowledge with the user's input, right? Together with the POTS SQL table schema, we can send them to the language model to get back the SQL itself. And then on the caching and the Kubernetes part, we also add a federation layer so that we can only, not only can we just query for one single cluster, but also we can query things across multiple clusters. So that's our final design here. All right. So a quick summary. So what have we achieved so far? So remember, like in our experiment, no matter it's like API invocation or the kubectl, kubectl like get command, I think language model has the potential to be a good SRE pilot. And it has already demonstrated its capability to do named entity recognition as well as like intention classification. And for these very specific tasks, I don't think you need like a 100 billion size model or a smaller model may already be enough for this type of case. But still like remember our task is to operate the cluster, right? But as this point of time, I don't think language model itself can operate a cluster by itself. Still it needs careful education and humans needs to do like reevaluation. So it can be a good co-pilot, help you to generate the command quickly or help you to like say write a bunch of SQL or write a bunch of configuration code. But let's not be that creative, right? Because in our scenario, we want operational efficiency, we want productivity and that's a very deterministic problem. Okay, so there's no silver bullet like to all the questions that are listed out here during our journey. I think hallucination is the top enemy, especially in our scenario. And long context is actually one of the challenges we are going to tackle next because we did a bunch of like, we estimated how many like tokens we use for a single query actually we send to the language model. Currently it's about like 4,000 tokens each time. So that's about the, I think all the common models or the currently available online, they all just take about 40, like 4,000 tokens one time, right? So if we want to add more like background knowledge or we want to do more like prompting techniques, it requires a longer context and that's the issue we are going to tackle next. And everyone's talk about like multi-modality, like using text pictures and et cetera. But in the SRE Waterland, I don't see any models today that can handle time series data like logs, metrics or trace as well. They all like have the similar issue with the long context problem. So I think everyone can think about this because I think there's still a long way to go. So finally, what are the lessons can we learn from today? I think if you want to like use language model in your own scenario or in your own application, you can use open AI as a very good starting point to verify your idea. At least it can be, it is still a very good like baseline model you can use to test in your own scenario. But once you determine that, okay, this might be the task or this might be the things I want to solve and you want to involve your own model, please focus on collecting your user data. And you know like data is the key to this problem. As we all know like we all say that the quality of the data actually determines the quality or the upper bound of your model's performance. And algorithm itself is the way to help you to achieve that goal. But data is actually more important, all right. And in our journey, in our story, I skipped a lot about our engineering effort but actually in the past six months, I think we devote about 80% of time during for engineering work because training the model or do some supervising is kind of like straightforward as long as you can prepare the data, right. And please don't think of AI as a magic and sometimes I think in the SRE words, sometimes some if else can already do the job and if it's that case, you don't have to use the AI, right. So with that, I will conclude my talk today. We have our plan to like open source our model and also the coop query engine but it's still in progress and so you can follow this official account to like to keep up with our updates. The official account is a account that I created last year. My own intention was to like write some stories about SRE work life or like do some SRE best practices but yeah, I'm going to do a, there's not much in the official account yet but we'll do it later. Yeah, so thank you all for coming today. I know it's about time and it's about holiday time so happy holidays. Yeah, I can take some questions either Mandarin or English. So the question is about like how much time we spend on during the training and to get a model, right. So it's about like getting the data. It's an engineering effort. We write the code and we use the templates to help us to boost up all the data. It's about like two weeks work. Two weeks, yeah. And the actual model training because the model size is kind of small. It only takes about I think less than a week to get your like first the model so that you can do some quick verification. The other question is actually you mentioned this is the command, right? This is the command of the CPCR and then you just, and then there's the level. At least the previous command this is a very common question. So it's like, it's a child GBT. I, they're not always trained. We only have a lot of people to ask him. So it's like compare with child GBT. Other people already have this model. You have some use. The question is about like comparing to like child GBT like child GBT itself is already, is also involving, right? People are asking about like you can draw questions to child GBTs and they can get trained. I think in our scenario, our model is more focused and dedicated to solve a specific like domain. I think that's why like these days everyone is talking about all those vertical models like in each different domains or areas. I don't think there's a, you have to compare them, whatever fits you. Yeah. I think that there are some things that are very common, child GBT has already trained a better model for you. And for some of the things that we personally specified, we really can, right? For each company, for each project, they have some training again. I don't know how to integrate them. Because if it's something common, it's obvious that child GBT can't be very, very much. And it's already trained very well, right? Let's do it again. Is it better? I think so. Yeah. So we should like leverage the power of the open source models like CodeLama or other ones for all those very common and very base languages, right? But for all the specific domain languages or domain knowledge, like what N-group does for their API calls must be different to like what your company does, right? So that part you have to like either use like in-context learning that provides this information to the model or you do the fine tuning based on those like CodeLama or the other models already available. Yeah. You don't have to do it from scratch. Yeah. Yeah. Actually, I have a question. Actually, if it's something like Commander, it's actually quite simple. Actually, we also wrote it quite accurately. I was wondering if I could write it. I did write it. But for some troubleshooting, you just said at the beginning, the biggest problem is troubleshooting. Yeah. Together tracing, magicers, and loggers. And then for this kind of troubleshooting, do you think it can do more things here? Can it be done through this? After I ask a question, can it be better than us to do troubleshooting? Because sometimes we do troubleshooting through a push. But ChatchBD, it's actually not a, I understand, it's not a cherry process, right? So you say, do you think there will be more contributions on the side? Okay, so the final question is about, like for troubleshooting, how can ChatchBD or the language model help us? Yes, I haven't seen a lot of work in this area, but my gut feeling is it can help, but we have to teach the model first. By teaching them, I mean, we have to prepare lots of data. Like think about it, like if you are a seven year old child, you have to learn something new, right? Your teacher will ask you to recite or to memorize something, like repeat it again and again, right? So that's what we do to language model, right? During the training phase, right? So yeah, we are definitely trying that, but I think it will still require time, yeah. Thank you so much. So my understanding is this system and tool is actually used for, I guess, figuring out ops issues, especially during instance, right? Have you folks considered anything or done anything about, I guess, reliability side? About like how to make sure the model doesn't like, you know, like the thing that serves the model doesn't have issues or like, there any kind of like failover plan for this? Thanks. Okay, so the question is about how we can use language model for like, reliability reviews or to figure out if there is, like say during architecture review, say there is, if anything, we can spot, right? So this model is used during, I guess, when you are doing operations, right? When if there is any kind of issue, you're trying to fix the issue or run commands. So it's already part of the tools, the operators or SREs are using. I guess the tool has to have some, some kind of availability guarantees. Is there any kind of like consideration about the whole, I guess the model serving system or like the tool chain or is there any kind of failover plan in case the tool or the model doesn't work? Yeah, it's just like not available. Thanks. Oh, so the question is about how to guarantee the model itself, the model's availability by itself, right? I think that's the topic about ML Ops. We can discuss it later. Quick question. How do you ensure the circle generated is correct? Did you send it back to the model to modify it? We haven't tried that, but yeah, we have some like, we have two cases. One is to like use like human evaluation. Like say you have a hundred questions you can ask, right? You have your humans, yeah, you can write your own SQL, right? And you can compare it with your like, the generative SQL. And I think the accuracy that it can be, it's not that important. Why? Because I didn't show my demo today because it was not there like ready yet. The demo itself, let me picture it. So when the user asks the question, right? And the SQL actually generated. And the generated SQL will actually show up together with the SQL queries result, right? If the human itself needs to verify like if the SQL, if the data is not what he wants, right? You can modify SQL and do the query again. I think the tool itself, the main purpose is not to get this thing a hundred percent correctly, but to help you to quickly like get the SQL there. And so that you can do a small tweaks. Yeah, thank you. Thank you for sharing with us your journey applying LAM in this scenario. And that's a very interesting case. What my question would be when you open source to this model, are you gonna really release the entire model? So because we probably have the same issue as we are private domain and we do not want to use check GPD the major problem is they will have the data. So will you be sharing the model? And so we can try ourselves. And this is certainly the very early days. And I really applaud what you did for, I think there's a lot of application including networking, there'll be a lot of operation can be resolved by this. And as well as the LLM, I think there is a place to reduce the human resources. The question is about like what's our open source plan? Yeah, so we are thinking about it. Like first like we are like running experiment on like open source models like Lama two or like code Lama. And I think like open source model itself is one way and the other way like what the keynote talk today is this morning like the hugging face team, they are like collecting all the data, right? Or if we can contribute the data, then maybe like different people can use them different ways. Yes, that's another option. But yeah, everything is still on plan and you are welcome to join us to talk later. Hi, yeah. So let's say if I'm having a reader role on a Kubernetes cluster with, if I use the natural language problem, we inherit the same role-based access permission. So your question is about the permission to operating a system, right? Yeah, correct. Yeah, so currently as you can see, like as I mentioned, like touching the system itself is kind of dangerous. So we have several guards. First, as you can see, we only tackle the curing part. So this is a read-only model. And secondly, you don't want to like list out all your resources, right? So you will like consume lots of memories and bandwidth to the API server. So we add the cache layer, right? So that's the two safeguards we have. Thank you. All right, so that's it. Thank you all.