 Hi, everyone. Welcome to the talk about Dev Ops for Ansible Lightspeed with IBM Watson Code Assistance. My name is James Wong. I'm with Red Hat. And before the start, I would like to introduce a little bit about myself. That's my Twitter. And I'm fascinated by two things. First of all, things of scale. You know, a system that process huge amount of data or request and with sustained performance and all these factors. That's these things fascinated me. And the second thing is I love automating things. I hate people doing tedious tasks and made mistakes because they have to do these kind of boring stuff. And from another aspect, it's also about scale. I love to scale up teams. You know, once you get rid of all these tedious, boring, easy to make mistake things, the team can move much faster. Now, so, okay. So that means I need to raise my voice, all right? Okay. Is it better? All right. Cool. So I need to know a little bit about you, too. Because I need, hopefully I could adjust my presentation a little bit according to audience here. How many of you are a software developer? You write applications. Okay. Good. And any automation engineer? All right. Cool. So I see double racing two times. Those people are dead lobster people, I think. Anyone writes Ansible playbooks? Also. Anyone write Terraform, PupPad and things like that, Chef? Yes. Okay. Cool. Any machine learning engineers here? Okay. Any data scientists who train models? All right. Cool. Nice. Okay. So this is Agenda. I'll try to share a little bit what is lightspeed. And then we're going to go through a little bit high-level architecture. And then we're going to talk about the platform that runs the service. And then the pipeline that we built to carry out from PR all the way to production. And then lastly, we're going to talk about lessons learned, some highlights. And then hopefully we have time for Q&A. Make sense? Yeah. Thanks. Disclaimers. We try to focus on the engineering perspectives here and only on the stack that is serving the forthcoming free technical preview. All right. So anything I said about future architecture changes, product updates, feature launches, do not trust me. All right. Here. What is Ansible lightspeed with IBM Watson code assistance? It is two things. One is the Ansible lightspeed. And the other one is IBM Watson code assistance. Combining two, it's an AI service. And we do two things. One is to help users like all our automation engineers to build automation code with consistency and with quality. And then the second thing, not obvious, but actually very important, we try to do content matching. Meaning we try to credit those who contribute all these code to where this model has been trained on. So what I'm going to do is a quick demo to show you the two functions that this Ansible lightspeed is doing. And then what you will need is a VS code and the Ansible plugin. You need a GitHub account and you also need to sign up for closed beta. Anyone of you who have signed up for closed beta? Oh, awesome. So you probably know what I'm doing here now already. All right. Let's go to the demo. And this is a live demo. And I hope that they won't be able to see me. Okay. So that's fine. That's fine. I can stand. So here this is the Ansible plugin. And after you set up the plugin, you go to the settings, you type in lightspeed and it will filter out these three options for you. Enable these two and then also type in the API endpoint here. Then you go here, you're going to see this Ansible icon here. Click on it. You have a connect button. All right. Go through the normal process of going through login. It's a GitHub login. Authorize it. And then you go back to your VS code, open up. Now you see here, it shows me as locked in as James. And then you could take a look at here. You could sign up from here if you are done with it or you don't want it to keep it. Once it's done, now you can actually go to do your Ansible lightspeed code recommendation. So I pre-bake a empty or template file here. Now if you check your status bar, you're going to see Ansible is detected, detecting the ammo file. And there is a lightspeed icon here. And you're going to see this icon spinning when it's doing the inferencing. So here, a simple task, let's say, I want to create an AWS S3 bucket called full. Enter it. Sorry, the font. Oh, yeah, sure. Better? Don't ask too much. Okay, here. Okay, see, okay, now it's already up here. You distract me. Sorry. So this is, you see the content right here, right? Now the recommendation comes back. It's up to the user to accept it or not. If you want to accept it, hit the tap keyboard key. And then it's in now here. So the next one, I kind of vary it a little bit. I don't use the code but use name to bar. I want to show you that this is a language model that could interpret, even though you vary the language a little bit. Again, see the spinning lightspeed? Oh, it's very fast. I can't even move my cursor to it. You accept it. And pay attention to the lower window here. This is the content matching output here. It shows you the model recommendation is based on these inputs from the three entries here. Now, if I come to another example here, mount the volume to media shared. And you're going to pay attention here. You're going to be changing after I accept it. Okay, let me tap key. Oh, did I miss that? Sorry. That's one more time. Okay, I accept it. And then you're going to see this be refreshed. So it changed. If you open them up, they're going to give you the URL of these content coming from. The network is a little bit slow, but you get the idea. So, yep, that's the demo for the lightspeed itself. All right. I hope you feel useful. And now you're going to go through the high-level architecture. I will walk you through the two workflows that we have just demoed. The first one is the code recommendations. And from the left-hand side, when the user hands on their VS code, they type the Ansible task description, step zero. You know why it's step zero? Because I forgot to put this step in, and I hate to shift the numbers, so I put the number zero here. It entered the task description. You hit enter. The plugin would send a request to the API endpoint. And this API endpoint, first of all, you would talk to the Reddisk cache. You would talk to the Postgrad database to do some kind of housekeeping work. For example, user validation. For example, rate limiting, because it is going to be a free service. We want to make sure everyone have a fair share of usage of the platform, among other things. After the housekeeping work is done, it will forward the request to a inference service, which is running IBM Watson runtime, serving the models. From there, it will return you the recommendations, and the API would do some more post-processing work. One of it is anonymizing the input and output, and we're going to be collecting them into a data analysis service. We're going to be using those to analyze the input and output, and also potentially use them to further retrain the models for better, better recommendation. And then we return the recommendation back to the VS code that you see showing up there. And then it's up to the user to decide to accept it or not. A little bit more on this API endpoint for the code recommendations. One of the post-processing we do here is running a service called Ansible Risk Insights. Basically, it's a set of rules that we run it against the recommendation, and then because we want to make sure the code are consistent, the code have good quality that Ansible can approve it off. And I think this is something that stands out because we have this huge Ansible community. We have this kind of knowledge of how well an Ansible playbook or task should be written to guarantee a risk level, though, and also a good quality of work here. So now the next workflow is the content matching. Here is, after you get back the recommendation, if you use a hit tab, accepting the results, the VS plug-in would talk to another API endpoint, which is we name it as attributions. It tries to do the content matching, so it will encode the input and outputs and using the encode key to do a hit elastic search that we built, and that will return what you have seen on the screen that they match the content, attributing or trying to credit those who have contributed to these codes. Okay, so that's the two workflow here. So basically you have seen that we have API services, we have model serving behind scene, we have database elastic search, and we have some outside of the body services help us with the event processing analysis. Now coming to the platform. So this is the platform that is running the service down below that is the OpenShift, and it is a managed service running on AWS, we call it Rosa. And then on top of that, we have the Red Hat OpenShift AI. It's the upstream of this project, of this product is the Open Data Hub. Anyone use Open Data Hub? All right, cool. So among other things, the one that we are mainly using in this stack is the ksurf, and it's running the IBM Watson runtime. So because that platform allows us to put in different runtime, whichever that match your usage. And then the other component here is a Red Hat Advanced Cluster Security. The first thing we want to make sure is that we are running on a secured platform. And this service will scan the cluster for vulnerabilities and send us alerts if there are some that's been discovered. And on top of that, we run Ansible Lightspeed there. Why we use these components? Make a good guess, anyone? Why we pick these? Because what? Because Red Hat. Yes, right? Okay, that can be it, but any more, any others. Okay, so it's really depending on your team. So our team is small, and we want to keep the team leading, and we also want to keep the team focused on building the service. So we look for system components that are managed, that are supported. And we believe that we let people who are good at doing their things, let them do that. Let them do it the best way that they could. And we focus on things that we know better. And then the second is that we want to find something that's hybrid cloud ready. Because today we offer it as a cloud service. We own everything, right? But later on, we decided to have an on-prem offering. Some customers may want to run the whole stack in their own data center. So we pick the components, try to fit these two criteria. Make sense? No? Yes? Okay, so now the platform, I want to talk a little bit about the CI CD side of it. So the three major components here, the first one is Ansible. And we mainly use Ansible to do the infrastructure creation. For example, creating the cluster, creating the VPC, the network, the RDS, Redis cache. We use Ansible to provision those. And then the second component is GitHub actions. And that is mainly for the CI, the PR check, testing, unit static analysis, things like that. And the last one, it's a very key component in our whole stack is the Red Hat OpenShift GitOps. The upstream of it is the Argo CD. And we enjoy, we love using this Argo CD. It's one of the components that bridge all these from PR all the way to deployment into the production. So now I would like to talk a little bit on the pipeline. The pipeline would carry out this task from a PR all the way to production. We try to do this thing. We try to shift as much of the testing to the left as possible. So at the PR stage, we would like to do as much full testing as early as possible. So in a PR stage, we would like to have, it's a full stack deployment that is as close to the production as possible. And then we could, and we also want to automate the process to set up this so that at the PR stage, either the developer or the QE engineers, they could already start testing the PR in its full stack integrity. And then by the time when someone clicks say, we're going to merge the PR, we are pretty confident, high confident that once this is merged into the main branch, the main branch will still be very production ready. So that's our goal. Here, this is the pipeline at the beginning of the pipeline, the PR check. So for each of the PR, it will go through the regular normally you would do, unit test, you know, code coverage, static analysis. And then it will build, if it pass, it will build the image and push it to Quay. Once it's done, it will trigger a deployment. And this deployment will deploy to the, a dev cluster. And it will have the whole stack of what I just showed you, the architecture. We have the light speed service, we have Postgres, Redis, and the Watson runtime behind it. And then we can carry out the test. We could do ad hoc testing. You could run your automated test against it. You could use this stack to develop your testing suite as well. And now bring us to the whole pipeline from end to end. The left hand side you see is the PR checks step that we just have talked into a little bit more detail. So once the developers decided, hey, let's merge it, it merged into the main, and it will build a release candidates image and push to Quay. When this is done, it will trigger the deployment to the staging cluster. So this staging cluster is almost a mirror to the production cluster. So once it's deployed, we have two set of tests. One is the post deployment testing. So it's the regular test suite, full stack, hitting it. And then the second one is a performance testing. This testing will be aiming to make sure that what happened changed, won't change the performance portfolio or the nature of the performance of the stack too much or surprisingly. Once these are all passing, then we will deploy to a production environment, the cluster. Again, in the production cluster, we have production testing suite hitting it, and we also would have performance testing hitting it. So some of these are not automated yet. For example, the performance testing at this moment, we're still doing a manual step carrying out, but we are moving, trying to automate as much as possible, moving forward. Now, some of the highlights that we have learned, a few of them. The first one is that to automate or not to automate. The team established at the beginning of this year, and we really have just a few months to productize it. We had a demo during last, I think it was October summit, Red Hat summit, but at that moment it's really just a demo. For the last six months, or less than six months, we tried to productize it. For purists, you want to automate everything. You don't want to, like I said, I love to automate everything. You want to keep everything as code, infrastructure, testing, your process, your deployment, everything in a code. You want to do that. It's lovely. It's great, but purists, they would have to think about something else. They want to make sure that we can make it to the market. Time to market is a pressure. It's a value that a lot of time trumps something ideal system you want to create. You would want to have a MVP, minimum viable product created as soon as possible, and then you want to really get it going out and allow people to start testing it and collecting feedback so that you could adjust and you could pivot. What happened to us is it's really about whether to automate or not. It's about when to do what. We set our priority. The first thing we want to focus on our effort, engineering effort on automating is those daily operations, those frequent operations, because these work could drag down the team, spend their effort, precious effort on doing daily or frequent. These are things like, for example, the testing framework, the PR checks, the deployment mechanisms. These things we want to automate as soon as possible. The second priority in the priority list is focusing on the security. For example, secret management, secret rotation, vulnerability monitoring, all these things are also on the top of the priority list. Don't take me wrong, we didn't do this on the first sprint or second sprint, no. We make progress on every sprint. At the same time, creating features and then we adding all these automations along the sprinting. There are some other tasks we decided to postpone them. For example, automating the infrastructure creation. These things are not being done frequently. For example, creating the open shift cluster, creating, setting up the VPC or RDS. We don't create them, do them every day or every sprint. We decided to use manual step to carry them out at the beginning and we document it. Once we have the document list, for example, you see we have multiple clusters, then we will allow different team members to go through the manual steps to validate that that step works. These documents are useful because when later on you actually want to carry out your automation, that steps, those playbooks, manual playbooks, it's what you're going to be targeting at to automating them. And what we are aiming at is trying to get to the state of a thermal production. We want to come to a state that we want to click button, create a cluster, and deploy all these components and then the servers. And then we can throw away the other cluster if you think that is to, we're already done with it. Or we want to exercise DR, all these things like that. Or we want to carry out a blue-green deployment up to the top of the level of the whole system architecture. Now, the second lesson I would like to talk about is about the machine learning service. So the model testing, it's quite different from regular application development. I have done, I helped teams to move their whole pipeline to continuous deployment. That is when user, when engineer click the merge of the PR, it will all the way go through the staging deployment testing. And if testing all good, it will automatically be deployed to production and serving the end users. I've done that before. But machine learning, it's quite different. The testing of it, it's fuzzy. It's not easy to say black and white. Oh, this testing came back to be positive and that's good. Let's move on to production right away. Because a lot of time, we figure, we found out that a new model version, it can be one step forward for certain aspects of it. And then it could be two step backward for some other aspects. So we have to make a judgment, hey, is this worth to move forward, to move this to production or not? So it's not easy to judge by a test suite. Or at least not now. You put it this way. Maybe it's our lacking of understanding the nature and come up with a solid logic to make a decision on it. So the approach we adopt is we have multiple models simultaneously deployed. And then we use feature flags and we use the deployment model of blue, green, canary to do that. So at the beginning, we have only one Argo CD application that have every component that you have seen on the architecture in one application of Argo CD. But now when we move along the development cycle, we figure this out, we break them up, break it up into multiple applications. So the individual components in its own can be evolved independently and also can be scaled independently. Now also for machine learning service, one thing we learned is that we have to identify the clearly workflows or GPU bound workflow or CPU bound. Once you identify it, you could effectively pin this workload on the corresponding compute node types, GPU node expensive, CPU node is much cheaper and you want to make sure you're running on them effectively. And you also want to learn a little bit more understanding on the runtime, the GPU batch processing and the time slicing, how effective or efficient they are really for you and whether they match the response time profile that you are looking for. And the last one is observability. So it's very important. Some people mistake that observability is when you are running a production environment, you want to monitor it. So it's not just for that. During the development testing cycle, you want to have a good possibility into your system because that helps you identify bottlenecks, that helps you identify the code quality issues. The tools that here list is what we are using. We're using for metrics is the Prometheus, Grafana and Dynatrace. It's coming, we're going to be adopting Dynatrace for that. And for tracing, we use Yeager and we actually just recently enabled Yeager tracing and it's done by an intern. That's pretty cool. And for the logs, we use CloudWatch for now. So this is observability. Now, any questions? Sorry, I can't hear exactly. Do I restrain the models? When end user modify the playbook, do we train a model at this? So as of now, okay, let me make it clear. The model training belongs to the IBM Watson code assistance team. So I don't have entire detail into how they carry out that, but my understanding is that they carry out the training on the Galaxy. So when you are doing it yourself, it wouldn't directly or real time in training on that. But we do collect some of the, like I said, those inputs and batch process it and being trained in the other round of the training trip. Does that make sense to you? Okay, cool. Thanks. Any other questions? How much time we have now? Oh, one more question? Yes. What's the plan for the future? Huh? What's the plan for future? Okay, you asked me to say something you don't want to trust me. So the next step is to open this up. So now it's closed beta, right? You have to sign up. Fourth coming, we're going to have a technology preview. So everyone have with a GitHub ID, you can go in, sign up and try it. And then after that, we would hopefully, we would have a commercial offering that hopefully will pay the bill, right? So, but all these just me talking, right? And like I said, it could potentially, some customers say, I want on-prem, right? I want to train on my own data. Like for example, we have automation hub, which is the private or the private version for corporations, right? Of Galaxy, right? So they want to train the model on their own, you know, dataset, their code playbooks that could be, you know, forthcoming. Does that help you? Okay, yeah. Sorry, just one more. So at this moment, you see that the automation, the recommendation, it's on task level, right? So hopefully we'll come to the point that we could recommend the whole playbooks level and maybe later on, whole modular collection, something like that. Yeah. And our product manager is dreaming that one day you tell the system and then you will come out, you don't even need to touch the playbook, you will just run it for you. That will be awesome, right? But, okay, yes, sorry. So the question is that when we are typing the task description, would more detailed input at that help you, help the recommendation engine to perform better, right? So I think yes, but not necessarily yes, because again, like I said, models is still kind of fuzzy in a way. And when you type the task, we do not just submit a task description, we submit a context as well. So for example, the beginning of the overall, the other task, we would submit them as well. So the model would judge from the context and then give you the recommendation. So that's why you, if you try it, try it, if you have a different context, you might get back different recommendations. Okay. Thanks. Sorry, I think you first. So the question is, is that do we have statistics about how users are satisfied with the recommendation and how do we compare with co-pilots? First of all, I do not have the statistics or I do not have it handy. We do have a gauge of the sentiments, we actually allow you to also send back the survey, not just from the co-completion. We do try to assess that and use that as hopefully as input to the model training team, the data scientists to get them. But I do not have that yet. And versus co-pilot is quite different because we also, like I said, we try to make sure the code is up to the quality and consistency for the sensible best practices. And also we try to attribute to who contributed that, which co-pilot doesn't do, as far as I know. Okay. That's another one. So the question is that comparison to co-pilot, I think I tried that one. And the other one is that enterprise corporations, they have their own roles. So is there any chance to train on them? I believe so. I could tell you that that's one of the directions we are trying to go forward. You will be able to train on your own. But as to details on when, how, that can be done, like I said, do not trust me. And we do want to go that way because you want to customize it, you know, for your own. So all of the time, that's it. Yeah. So yeah, if you have questions, you can Twitter me or you can talk to me afterwards. Initially I have a demo for the PR standing up using our Google CD, but I'm out of time maybe next time. All right. Thank you.