 So, we really want to thank everybody for the time and effort it takes to build an operator and congratulate you and some of you who are about to get yours in Operator Hub. But if anybody out there in the audience is thinking about building an operator and needs some help, please see me during the break and I will make sure you get that help. Without further ado, shirard. All right. We have microphones. We're all good to go. Everyone has a chair, except me. All right. So, thank you, everyone. Well, we'll just pass the microphones around, yes. We'll do that. I think this is going to be a really fun discussion here. We'll talk about operators and, you know, kind of the process that everyone has gone through with operators. So, with that said, let's just kind of kick things off with a round of introductions. You're familiar with me, some of the members here you've heard from as well. But maybe just give us a couple of sentences about who you are and then also what your products do and kind of where they fit in this whole operator story. So, Sunny, you want to start with you? Thank you for the invitation. I'm Sunny, co-founder and president of Provistor. Briefly our solution provides visibility and using machine learning to learn about user application workload and then trying to come up with the optimization for the resources like CPU memory or even GPUs for supporting your workload. Great. Gopal. My name is Gopal Krishnan. I'm the VP of Engineering and Customer Support for Cognitive Scale and we make our customers primarily in the financial services and healthcare successful with AI with our product Cortex which allows for a build solutions in a predictable way as well as to make sure that they are trusted. Early we heard about ethics, you know, embedding a lot of concepts to make sure the systems are trusted and can go through some level of compliance. So that's our product offering. No, I'm going to get too much. Too much. Let's try this one, shall we? Hey, I'm Ryan. I'm a developer at Selden. We make it easy for data scientists to deploy machine learning models to Kubernetes and run serve HTTP requests or serve predictions as HTTP requests and provide out-of-the-box features for observability, so like monitoring and tracing metrics, but also features like making it easy to define rollout strategies with traffic splitting and to define inference graphs so you can have reusable steps within the graph and do more complex kind of deployments like multi-on bandits. Yeah, that's me. Thank you. I think I have my microphone. Hi, I'm Azadeh Kojendi. I'm a senior software engineer at Microsoft and I'm involved with the Azure Databricks operator that you saw the demo previously. Hi, I'm Kamil Bajda-Pawlikowski, I'm CTO and co-founder of Starburst Data. Starburst is a commercial arm for an open source project, Presto, which is distributed SQL analytics across many different data sources. Hey, everyone. My name is Pramod. I'm a senior product manager in our data center business unit at NVIDIA. So we make GPUs. I product manage CUDA, which is our GPU computing platform, and we're building both hardware as well as software that enables AI, ML, data analytics workloads and many different types of use cases that will be accelerated using GPUs. Great. Thanks, everyone. We talked a lot about Kubernetes. We talked a lot about operators. Kubernetes is kind of taking the world by storm and operators, in my opinion, are kind of very close behind that. At what stage of your companies, each one of you have different products. Some of you may have started off more in the Kubernetes or container space. Some of you may be traditional applications. At what stage did you start to think about operators in Kubernetes as something that your company really needed to invest in? Maybe Camille will start with you with Starburst. Yeah, so we were obviously tracking Kubernetes for some time, but about a year ago, I would say, we realized to scale Presto deployments to more and more different environments. Traditionally, that was on the prem and maybe at AWS, but now you have to go to Azure, Google Cloud, to different environments. Kubernetes is a great framework for running Presto because it's a complex system, running potentially tens, hundreds, maybe up to a thousand machines. Managing all of that is really, really hard. So bringing uniformity across different environments, different deployment mechanisms is really important. And we chose operators as a framework to implement our Kubernetes integration. Great, great. Gopal, maybe you can shed some light on that. So it's been a long journey for our company, it's been in Austin for five, six years now. Earlier on in the genesis of the company, we've been trying to leverage container technologies as well as the orchestration mechanisms. So several times during the journey, we have evaluated Kubernetes and kind of fell short for one reason or another, timing was not right. So last fall is when we made a concerted effort to embrace Kubernetes and work with Microsoft actually to render our product on Azure. And that was a very positive experience for us. And then we quickly learned that the investments we made to Kubernetes is rewarding us by allowing us to go to the next set of cloud vendors like EKS, Red Hat, OpenShift, as well as GKE as well. So we are able to evolve and rapidly satisfy our customer needs to go on-prem or on their own cloud or on our cloud to support them. So it was a journey and we found it very rewarding one at this point. Do you find that building operators, is it something that your customers were asking? Does it make things more accessible to your customers? Okay, let's talk about operators. So we heard about operators during the journey itself. And right now, we have experimented with operators and our customers are moving a little bit slower than we want it to be. And that's because of various reasons. They are not there yet with their own internal organization and selection of technologies to move forward. So we are kind of held back. We are being a small company. We can't just be experimental about it and everything. So we are driven by our customer priorities and we are very closely monitoring and pushing them towards operator. And we find that as a very welcome addition to Kubernetes because of the promises it's making, maintenance of software, upgrades, as well as moving towards your, at least towards phase two, if not all the way to phase five at this point. Sunny, what about you with Profit Store? Since our solution is built to help customers do cost resource optimization, so it's quite natural for us to use Operator Framework to be our solution. Once a customer asks how to install a solution to help them to do cost optimization, we tell them it's Operator or we are actually fully certified level five operators on an OpenShift, then there's no further questions asked. And then Pramud, the work you're working on with NVIDIA, in terms of the customers, what they're looking for with operators, how does that fall in line with what you're doing? So we started working on an operator for GPU deployment, essentially, with Red Hat a couple of months ago. So we've been building with Red Hat a GPU operator that basically helps us provision GPUs. So GPUs are special resources within Kubernetes. So bringing up a GPU can be fairly complex because we need to install user mode drivers, kernel modules, then we need to install device plugins to have these resources exposed to the Kubernetes master control plane. So setting up Kubernetes and deploying GPUs is fairly complex. And even once you deploy it, GPUs are throughput machines, which means you need to constantly feed the GPU with data so that it's operating at full throughput capacity. So the GPU operator that we've been building with Red Hat essentially automates the management and deployment of all of these GPU components, the device plugin monitoring stacks automatically. So it's been pretty cool technology for us because it simplifies users deploying these GPU worker nodes very easily. So we've also been looking into using operators to do other things now. So specifically, MPI jobs can also be fairly complex because you need to have all of these worker nodes participating together. So we're also looking at using GPU direct technology, which is our technology that allows GPUs to communicate to NICs and storage directly without having to go through the CPU. So operators will also play a big part in doing things like RDMA and other technologies like that. So it's pretty exciting for us when we look at operators. Great, great. It sounds like it does a lot of the automation of what would take a lot of time for someone or an engineer to do. Absolutely, yeah. Great, great, fantastic. So Oz, can you maybe tell us a little bit? You talked about the data science experience. Where do you see operators easing the adoption of AI and data science workloads? Adoption of AI, okay. So basically for us, because we work with customers, so each customer has different needs and what they want. So in terms of adoption of AI, if I understood your question correctly, how they use it for their workload. So we started last December with one of the biggest shopping centers in supermarkets in Australia that they wanted to have their training pipelines on Kubernetes. So we can see the trend from our customers that they like to have operators and using Kubernetes in their AOS space. So we can see a trend in it. And Ryan, maybe do you run across a lot of customers looking to do ML and AI with Seldin and does a Seldin operator do anything for them in that space? Yeah, certainly we chose to use an operator because it allowed us to sort of condense the serving specification in a kind of very neat and expressive way. And it's important that we can express that in the way that the data scientist can understand and control. So that the data scientist is empowered to be able to put together that serving specification and run it without having to wait for the ops team to figure out how to get something up and running. Great. And in terms of developing operators, there's what's the difference in terms of if you were to, let's say, build out an application that would be deployed using Helm charts or Ansible versus now having that deployment controlled by the operator itself. So maybe Camille or Oz, if you want to maybe talk about the differences there. Yeah. So in our case, I guess we didn't do Ansible or Helm charts before. However, we developed a custom framework to deploy Presto on a big cluster. And that was a lot of scripting, a lot of Python, running things in parallel. You have to maintain that code and deal with all the failure cases. What if some machine doesn't upgrade properly and it's just really, really difficult to do all of this manually? So then when we decided to move to Kubernetes, well, most of those things are provided by the framework. You have to still instrument the framework and orchestrate all of this, but you don't have to deal with individual machine in a cluster coming up or down or being misconfigured or all those different problems. So that was the vast simplification of how much code we have to develop and how much code we have to provide to our customers versus what's guaranteed by the framework. You can ensure high availability. You can ensure things like auto scaling with horizontal pot auto scaler. You can just focus on instrumenting things so that your system operates very well, rather than solving basic problems on the infrastructure level. So the way I see it, Helm is for packaging the desired states and then operators are responsible to make it happen. So it says, in my opinion, there are two different separate concerns and Helm is for packaging and operators for making it happen or just monitoring and making sure that the desired state is always there. Okay, and Ryan, maybe your experience from a developer side? Yeah, so I guess in our particular case, because we have the option to create a kind of complex graph of steps where you have like pre-processing and then you hit the model prediction and you might have some sort of post-processing step or there might be more complex graphs than that, to capture that, it made sense to be able to do it in a custom resource and then have the operator fulfill that. I think it would have been quite a burden to put it onto the users to have to figure out how to create the right resources using a Helm chart or something. It's the logic that we want to encapsulate in the operator. Great. So one of the things we are finding out with large customers is how much we want them to just absorb changes and put it into production right away. That's not happening. And there's always change management, windows for maintenance, all of them come into play and we operate in a very low tolerance for any errors that's going wrong. And the specific AI ML world, you've got to change the algorithm probably more frequently as it learns and updates your model and come back. So you may be deploying new versions of the model every couple of weeks and sometimes maybe every week depending on the sensitivity of the problem. So when you try to do that without any formalism and simplification like operator framework and stuff and you do a lot of dotting the i's and dashing the t's in order to get through the change management windows and this one we believe will minimize the risk. So that reminds me, I just recently went to a doctor's office and they're still on windows XP. You get this change management where there's not enough of a value to move off of it, but at the meantime your company has to support older versions. So maybe Sunny, if that's something you want to dive into. Totally, I agree. That is make the change update particular for machine learning solution much easier. Yeah, especially with the rapid iteration of these changes, there's things coming out all the time. And I guess another added benefit is because if you do have a platform like OpenShift and the customer has many different environments, they can kind of upgrade all of those environments maybe in a single time. Maybe Sunny if you want to kind of go into what that means for your customers as well. Yeah, yes, because in all machine learning model you have to keep updating it sometimes as frequent as per weekly basis. So, you know, once you have these operators and customers will be much easier for them to understand what it means by getting an update, right? Because customers don't want to have any allot of updates too frequently and operators are making it so much easier. So, we talked a great deal about the benefits of operators. Maybe someone wants to volunteer and talk about the challenges you've faced building an operator. I imagine just like any new technology, there's going to be some ebbs and flows and ups and downs. So, does anyone want to take a first stab at some of the complexities they've run into? Yeah, I can go next. You go first. Well, actually I was going to pick up on a point that you said earlier about the variety of tools that are out there. It's a space that's moving so quickly and actually not so long back Seltons operator was ridden in Java and we realized the Java support really wasn't keeping up with the changes and had to switch to the co-builder stuff. So, it's kind of like I can talk about challenges about an hour. It's a different talk but one of the kind of like high levels of the challenges, one is test and build and test pipeline. So, imagine that we put our operator that it's basically created resources on Azure. So, now we are an open source and then you want to do an integration test on the PRs. So, how you want to make sure that you know it's secure. So, someone is just not provision a cluster of ten nodes for doing the build pipeline because our build pipeline is part of our GitHub repo. So, one option is around that. Another challenge is the versioning of the CRDs and the specs that you are saving it. So, between different version of the operators. So, if you change your model that you're saving it, I mean that your data model and not machine learning model. So, when you save that the data model, so how you want to do keep the version in between two versions and so that's another challenge. Okay. And Ryan, you mentioned it, you know, a lot of different frameworks, a lot of different ways of doing it. What has the experience been like for those of you like the operator framework or the operator SDK? What has that experience been like? We used Q-builder but we looked into the operator framework. It was very quite similar. But we have a team of SMEs that we kind of like chat with them before we pick and then we are using the framework based on the skill set of our customer and our team and we picked Q-builder because we found it's easier to use. But it's still the functionality wise, it's exactly the same. And then for even we realized that so when you look into the docs that we can easily use Q-builder or the CRDs to push it into operator hub so there is no difference in terms of degenerated result. That's great. And the great thing is there are tools out there. We also have resources in the OpenShift community to help out getting operators and rolled into the operator hub and using the operator SDK and the frameworks. That's great. What about for those of you, I think some of the panel members you have certified operators, does anyone maybe want to speak about the process? Or first off, why did you want to go the route of having a certified operator and if you're currently going through the process? Then what is it, let's talk a little bit about what does certification mean in this case for your customers? I'm happy to say something about that. So I think for us it's just really cool that people who are using OpenShift can just click the button and they get the operator installed. It's like a very obvious win for us to make it installed that easy. But also that gives a lot of confidence to the customers that are running on OpenShift that they know this is a supported operator that they can work with and that has been through that process. Regarding the process of getting certified, the support we got from Red Hat was amazing. Trevor from CHI in particular was amazing. I'll let him know. Camille? Yeah, I would second that. I think the best experience was the help from the Red Hat team in terms of testing that, troubleshooting any issues, providing infrastructure to certify our Kubernetes integration on OpenShift. So that was a great part. I think the process obviously is fairly new, so it has some rough edges. But with that extra help, I think that was really, really smooth experience. And yeah, there are definitely some notes I have to share for the improvement of the process for the future generation of certification. But I think it's really good and we are actually successfully there on Open Operator Hub right now. And that's really important for some of our customers obviously, right? If you go to a bank or a big enterprise, the fact that it's certified on OpenShift, the platform they're actually using for Kubernetes is a win immediately. Sonny? I just want to reiterate the other members at first of all confidence from the customers and the certification process to make it so much easier because of support from Red Hat. We have done it I think in a very short time period. Of course, my team has done it but they tell us that Red Hat has been a great partner. Great, great. I have a different perspective. We haven't been certified yet. So I mean, I'm just kind of encouraging that, you know, the certification process has been Red Hat plays a very successful role. For us, I think having a certified operator is very important as we scale with Red Hat into enterprises because we have a lot of components that are managed by the operator. So having a certified operator will give us, give customers a very reassurance that there's support behind it. So I mean, I'm looking forward to the certification process. And for those not familiar with the certification process and what it means, the great thing about the certification process is it gets all of the components of the operator on the same type of operating system. In this case, we use UBI, which is, you know, our way, Red Hat's way of saying this is a nice certified operator in terms of we'll support it. But then also you get the support from the partners as well. You have that dual level of support both from Red Hat and the partners. All right, so let's talk a little bit more about operators and AI. One of the things that's coming, you know, we, it was briefly mentioned before about the phase two operator. If you're not familiar, if you go to operator hub, you can see more information about the different phases. But really, one of the kind of the high levels here is when you start to think about how can AI and automation be really taken to the next level with operators themselves. And so we kind of, we kind of talk about that as like a phase five autopilot operator. So maybe if someone wants to discuss, you know, if they have plans to achieve that autopilot status and maybe bake some machine learning into their operator, maybe give some thoughts on what that might look like. Well, I suppose actually we do, we do something similar. We have a way to enable sort of data science functionality in the custom resource. So you can, for example, turn on an outlier detector component, and that will then run alongside your model. And then if a particular data point is marked as an outlier, then you know that that's kind of, that that's potentially like, you might get a lower quality prediction for those particular records. We're looking to do more in this kind of space as well to like, to do automatic detection, maybe for concept drift. Yeah, so I think, I don't know whether the autopilot would be more in the lines of like having AI embedded in the operator, the actual like state reconciliation itself, if that's what you had in mind, or if it's just in. That's actually it. Yeah, you know, have using AI to, to using AI in the sense of managing and deploying your application. So maybe Gopal, if you have thoughts there. Yeah. There are a couple of dimensions. This is one is I think when I read your question, you're talking about, you know, in the operator framework, you are doing a lot of data collection and you're doing a lot of performance telemetry, etc. So you can use that definitely, you know, wherever there's data, there's always opportunity to learn and make it easier for people to operationalize some other decisions. In cognitive scale, we have just released a product called certify with AI at the end instead of why. So what it does is allows models to be checked at various points of its life cycle to get a score. We call it ATX score, which is a combination of explainability, robustness, as well as bias combined with data and compliance related scores. So we, our intention is to migrate towards chaining that to operator framework so that when models change, either during the time of change management or while it's in the process of change management, or while it's functioning as well in a way to measure its dimensions and report the outliers and be able to initiate, you know, remediation activities. So that's what we are hoping. We'll get there as soon as we can. Great. So I think we have a few minutes left here, right? So let's go kind of just down the line here, starting with you, Sunny, and what's next? What's next for your company in terms of operators and, you know, what you're working on? Well, I will give a bit more detail in my presentation this afternoon. I want to say that the next is to get, analyze more and more data, more metrics, performance data, all collected from various different kinds of sources on Kubernetes or even lower layer platforms. Great. Yeah, for us, it's more of aligning with our customers, large banks and healthcare companies and make sure we are going hand in glove with Red Hat, Kubernetes and OpenShift and whatever they ask us to do to be, you know, credible as well as, you know, operable in their environment. So that's our next quote. Yeah, so we've recently just added, in the process of building out a bit further, the user interface around Seldin so that you can see your deployments more clearly and get visibility in one place of all the metrics and the behavior of your deployments and make it even easier to initiate new ones with wizards to generate the custom resources. Another thing that we're working on more closely in the operator space, actually, is that a collaboration with the Kubeflow project to do model serving, to serve HTTP predictions, but in a serverless way based on Knative. So you can scale to zero and make more use of the underlying infrastructure resources. In terms of the Azure Databricks operator, we are on a journey to go to the operator hub and getting our certification and spoiler right out, there would be some other announcement related to the operators at KubeCon from the Microsoft side. I can tell a little, I can't tell more. We won't tell. Yeah, but yeah, watch this space. Yeah, we'll be there anyway. There will be some announcement. So I think after successful release of the initial version operator, we want to expand and take more advantage of the native Presto features that we are now developing in conjunction with the operator framework capabilities. So for example, in the Presto environment, you have multi-nodes running and sometimes you have spikes in load, sometimes the load is slower, and there are many different reasons and potential causes how to improve performance of the cluster. Sometimes you have to add more nodes for some time, and that may be based on CPU, which we support today, but sometimes you need more memory simply, right? Your workload just requires more memory, otherwise you won't be able to finish successfully in a short time. So we want to add that capability and tap into various metrics we can control. Another thing is security configuration, especially if you're dealing with multiple different data sources is often a tricky thing to control and simplifying the configuration and change management of that through operator mechanism will be the next thing for us to do. So from NVIDIA's perspective, there are two things that I can think of. One is we operate a fairly large Kubernetes cluster internally to serve the company's compute needs. For there, we offer our users an MPI operator just to manage their MPI batch workloads for training. And quickly, we are trying to use Kubernetes to do multi-node deep learning training because as the models get supremely complex, you have billions of parameters to train models. So we are looking to use Kubernetes to do multi-node training, and I'm pretty sure that there's going to be more operator work there as we try to get all these nodes cooperating together to do deep learning training. Then on the other side of the spectrum, we're also working on edge use cases. So there's going to be a lot of tiny little GPU accelerated devices trying to do inferencing at the edge related to video analytics, smart cities and so on, where already those applications are fairly complex and we're building operators to be able to deploy those applications at the edge as well. So I think there's a big spectrum of use cases that I see that we'll be working on fairly soon. Great. Fantastic. Although each one of you have vastly different products, the nice thing is there's a single thread of making things easier and enabling customers. And I think that's really, when you think about OpenShift and the foundation of OpenShift and why it's so popular, that's at the core of it. And I think operators are following right along there and just making it much easier for customers. So I want to thank you guys for participating. Hopefully, that was really good information. And if you guys have questions about their operators, feel free to ask them.