 Welcome, everyone. Welcome, KubeCon. This is a wonderful crowd. The conference is buzzing today. It's been pretty amazing, and it's great to be back. I am Michael. This is my colleague, Varsha. We're here to talk about breaking the rules of operator development. We've both been contributors, maintainers, and fans and advocates of different pieces of the operator framework at different times and over a quite long period now. We've talked in the past about best practices and many of the so-called rules of how to effectively write operators. Well, today we're breaking the rules, we're turning some of this upside down for this very specific use case of the far edge, and in particular, A.I. at the far edge. It doesn't have to be A.I. at the far edge. Many of this will apply to any kind of smaller footprint, remote device sort of scenario, but increasingly, it'll be harder to find probably those edge workloads that don't have some A.I. involved. So with that, we're going to go through a bunch of ideas. We hope we can inspire you with some new patterns, some new thoughts. Some of it might be crazy and not apply to your scenarios at all. Some of it might apply to scenarios in the future. Some of it might set off a light bulb and you'll think, wow, that's useful. But in any case, we hope you enjoy it, and Varsha is going to take it away and introduce this whole topic. Yes. Hey, everyone. Nice to see you all here. So the topic for today is, as Michael said, breaking the rules of operator development. And most of these rules probably we ourselves would have advocated in the previous cube cons. So, yeah. Before breaking the rules, let's understand our use case better. So for the sake of this presentation, we are going to deal just with far edge because the complications on far edge are kind of different than what we usually encounter in other scenarios. So what's far edge? The whole concept of edge computing is basically bringing the computing resources closer to the area of data collection. And the definition of edge doesn't just stop to a single line. It basically differs based on the proximity of the data source. And in this presentation, we'll mostly deal with far edge, which means the device which is collecting and computing is very much close to the source from which the data is getting collected. So this could be any IoT sensors, or probably any edge devices which are being deployed in challenging environments where human intervention is not so easily possible. Now, why Kubernetes operators at far edge? The compute and the resource constraints are already very heavy. So why do we even need operators on those devices? So we'll go through a list of reasons why Kubernetes operators are useful at far edge. And then we'll follow up on with the practices on how we could utilize the whole operator pattern better and make the best use of operators in general. So the first thing is the resource constraint problem. And why do we need gates operators in that case? So the solution which gates operators provide us is it automates the resource management. With the whole concept of storage sharding and GPU slicing coming in, it becomes increasingly important to be extremely aware of the resources which are being used by the model itself and by the operator. So operators help us in splitting the resources and managing the resource constraints. The second reason is scalability. Usually deploying a single AI model or a single pattern on multiple edge devices is pretty common. And in that case, operators kind of help us in doing so by automating the repetitive tasks. Deployment complexity. Even if the whole deployment is easy, the task is mundane. So automating it makes a lot of sense. Now, ensuring consistent and automated deployment across multiple edge devices is something which is very complicated and that can be easily done by the Kubernetes operators. Now, there is a little bit more concern here which we'll talk about in detail and the relationship between the model version and the operator version itself. This brings into question the relationship between an operator and operand and we'll delve that in the upcoming slides. The next thing is high availability. Now imagine you have two edge devices which are placed in close proximity with each other and you would like to maintain high availability and reduce the single point of failure. Now, having consensus with even number of devices is always an issue. Now, the operator can be a single point which kind of manages the availability of a particular application or a modeling source on both of these devices by being external and managing individual single node clusters on both of them. Networking and communication. A very simple example of this is managing VPNs and ensuring who can access a edge device and this is one of a very important application because security is a major concern when you deploy devices on the edge. The next is life cycle management. So usually having life cycle managers like say OLM or even Helm is very difficult. So if you want to manage the life cycle of an application itself, operators provide a better way and it's very important to have the application management and the management of other individual models inside the operator built into a single binary and we'll again go deeper into this in the upcoming slides. Security as we spoke about earlier but this one mostly relates to having or using Kubernetes R back and A back controls to ensure which service account can access what part of a cluster on an edge device. So coming to operators at Farage we have seen the challenges and we have seen why operators make sense. Now let's go into what are the steps we can take to ensure that these challenges can be mitigated in the best possible way. Now the first one was the issue of resource utilization. Now how can we develop operators that reduce the resource utilization? So one suggestion is to optimize the operator design using the single responsibility principle and this is a very important topic because by single responsibility principle I refer to having individual responsibility on the controller level rather than the operator level itself which means a single operator can have multiple controllers and each controller can manage individual tasks. Next is lightweight base images which means probably use images which are lightweight and do not contain unnecessary packages which are kind of not useful for the application or for the functioning of the operator itself. A very simple example for this is using UBI minimal instead of UBI itself. The third is setting appropriate resource limits on operator deployments and this is something which we get by the fact that Kubernetes provides us with limits and request that helps us set the minimum and maximum amount of CPU and memory consumption in our deployment. The next important issue which its devices face is the compute footprint. It needs to be as less as possible and some of the ways to do that are reducing dependency. So it's better to use libraries or packages which have less external dependency to ensure that the whole application or the container is lightweight. The next one is keeping projects lean and enabling dependency pruning and this is something which Go kind of gives us I think Go 1.11 is when a good module started. So pruning is easier but if you are developing an operator in any other language it depends on you to ensure that the right dependencies are only added and the unnecessary ones are pruned. The third one is avoiding webhooks. It's better to avoid webhooks because it increases the amount of HTTP calls happening whether it's within whether it's to the API server itself. So some of the options instead of using webhooks is basically using cell language validation and a common expression language which is being developed by Sega API Machinery intensively is one of the ways in which you can validate a particular CRB spec in a better way. The next one is selective caching and with the recent releases of controller runtime it has become all the more easier in terms of figuring out what we need to cache and just caching what's necessary. So quickly going into the options which controller runtime provides us it's using label selector it's using a field selector to identify the objects we want to cache or you can even cache the whole group kind. For example in this case there's an option to specify what kind of parts we want to cache based on the label selector but you can actually cache the whole set of nodes sorry it was nodes, or the whole set of parts which are belonging to a particular namespace or you could even cache the entire particular namespace based on whatever objects are present in it. Moving on to the next one is choosing multi-cluster architecture very very wisely. In certain cases it is very much useful where you have a single manager and multiple spoke nodes so that the single manager can basically deploy multiple other operators but then this kind of brings in a lot of other issues in terms of network dependency, complexity of handling the whole multi-cluster scenario and also synchronization challenges between the specific spoke clusters in general. With the edge devices since they are basically deployed in challenging conditions it is very difficult to ensure that there is a consistent network connectivity always so make sure that you plan for infrequent upgrades which means this is going to be a case when you would like to upgrade a model but then upgrade need not happen immediately. The next one is using a remote edge repository this brings in a lot of benefits one is having an edge repository ensures to decouple the lifecycle of the model development than the operator itself and the other benefit is you could have the remote edge repository having stages of model development at different instance of time and then you can put them into production at any instant of time when you feel that the model is ready to collect data and make changes to it. The next slide Thank you, Varsha. So now we are going to rethink the relationship between the operator and the operand. If you are not familiar with that terminology that is a shorthand way of describing the operand is the application, the workload the thing you are really trying to run in production to solve a business problem the operator is the extension of you it is modeled after a human operator automating that person that human operator's job to reconcile and solve problems dynamically and in response to events in a cluster, right? So we have some long standing advice and thoughts about this relationship between the two let's look at some different ideas So here is a common best practice now we are starting with a set of I've got like four pods we might call these things here this is an inference kind of scenario we have a data collector we have a model server we have some kind of business application and probably it stores some states maybe there is a database maybe there is a message broker you can imagine those kinds of things now in reality, each one of these might be its own whole collection of services but for this exercise let's just consider them as one entity and now our typical best practice is we are going to have an operator for each one for example, the database for example let's say we are using Postgres your Postgres and your Postgres operator might be re-usable in other contexts with completely different scenarios in fact maybe you even got that operator and database from some third party vendor the downside here is now again, we are at the edge, we are at the far edge maybe we are on a constrained footprint we have just doubled the number of pods we have doubled the number of processes that are running the operators, they mostly sit around not doing a whole lot they are waiting for events in the cluster but most of their life, everything is good there isn't an upgrade rolling out right at this moment so they are just sitting around why do we want to have all four of them or more running all the time can we make this more efficient well, here is this like bold, crazy idea we can actually combine operators it's more work it's not for every scenario but it will definitely give you a lighter footprint so some benefits here, we have a shared cache now, I'm sure many of you are familiar with this concept that it's very important to use a caching client in your operator design we work very hard to protect the API service from unnecessary requests so your operator will have its own cache of all the resources that it's looked at before hopefully and we've probably seen okay, we try to be very selective and follow these strategies that varsh was describing about only watching the narrow range of resources that you really care about that are actually relevant to your operator or your controller but sometimes you end up with other things in your cache you really didn't want we often see config maps all of a sudden you're watching every config map in the cluster by accident and now that's all in memory and you multiply that times three or four or five different operators that's a lot of extra stuff being stored copy after copy after copy so having a shared cache by putting all of these controllers into one process can be a significant benefit there likewise, having a shared connection or shared connections for all of your list watches so now you have one operator and one manager if you're using these resources you can share instead of having each individual controller each with its own list watch against the API service now you just have potentially one per type and just reduce process overhead obviously you're eliminating all those extra go runtimes just normal process overhead in general fewer pods to worry about there's some efficiency but obviously there's downsides I'm sure you have some in mind you get a bigger image out of this bigger container image it's harder to update now because if you want to fix a bug in any one of these controllers you're rebuilding this whole image and shipping this entire new image with all of your controllers and operators in one shipping that whole thing a new version out to the edge as opposed to being able to ship maybe just a narrower feature set in just one operator so that's something you need to balance and then obviously there's software complexity like it's going to take some extra software engineering to combine these things you're bringing in dependencies it might have conflicts with each other depending on who created these operators if you're bringing in a third party operator or even just teams that didn't really collaborate with each other during the development process so you might have some things to reconcile there so to speak but this might be an idea that's worth considering now here's another default practice speaking of operator relationships so here we have three clusters and again we're talking about the far edge where it's very common to have single node Kubernetes that's a whole other discussion maybe but it's really it's quite common like the independence of this system is very important and maintaining a reliable highly available multi node control planes even at the far edge can be kind of complex so we see a lot of single node in each cluster gets its own copy of an operator like that's a very natural thing an operator is extending the Kubernetes API with custom resource definitions right so naturally wherever you have a control plane you're putting your controllers there to add and implement these new APIs but like we talked about a few minutes ago these operators are pretty lazy like they sit around doing mostly nothing most of the time so why do we need to have a copy of every single one in every single cluster sitting around mostly not doing much it's perfectly reasonable to consider another option do we really need it in every cluster we can start to move toward a multi cluster operator world we've talked about this some before there is some support for this pattern in controller runtime control runtime by the way if you're not familiar with a library that's extremely popular for implementing operators and controllers so there is some support for having a single controller that's interacting with many different clusters but it's still a world that's largely unexplored and is right for exploration and maturity and pushing that forward if that's something you're interested in I would encourage me to get involved with the operator framework community and see what you can do but this is an obvious option for consolidation at the edge potentially even here you might consider like a dedicated point of management that's a common pattern we see as well depends on your scenario maybe you've got one cluster dedicated to local management tooling you run your container image registry there maybe you run some operators there other local tooling that might be useful for then your like dedicated purpose built device kinds of clusters alright let's deep dive into operands themselves here's an expanded scenario of inference at the edge so starting on the left we've got real world data that's coming in maybe you've got sensors maybe it's cameras maybe it's measuring the temperature outside some kind of real world data that is forming your input now there's some set of services doing pre-processing on that maybe it's normalizing the data maybe it's converting it to a different format you can imagine what might be a wise idea there before sending that data into some system that's doing the actual inference now there's a wide variety of options here some of it depends on whose hardware are you running what tool set have you chosen to implement your model training and model delivery so we'll leave that separate but just imagine there's a lot of variability in this inference box potentially you've got a model that is a required piece of data and then there's some application that's pulling the output and solving some real business problem with that inference right okay where are the operands in this diagram right there these are the things that you and I can operate they're functional pieces they're doing something they're acting within this system what about the model the model is data it's very tempting to include the model in the container image with that's doing the inference whatever that that system is that you're using we commonly do that sort of thing we put all the dependencies like this is the dream of the container image right we put everything in that container image that like shipping image you know shipping container metaphor we put everything in there that is required and we can move it wherever we want config data but here this data in particular this model can get really big really really big and it has its own life cycle it might get updated it might get refined a completely different life cycle than the software that's involved in this system so here what I'm arguing and recommending to you is that thinking about the model as not part of the operand but as a separate piece to manage independently very deliberately is going to save you some headache so how are we going to put this into practice this is how we've been typically recommending delivering and releasing operators into the real world we have some operator release artifact that's got the usual things it's got a name so this is our you know our hot new operator it has a version in particular the CRD is how do you actually run the operators yada yada you know and importantly we have a reference to an operator container image an immutable you know content addressable image reference and then as many operand image references as are necessary and that is now an artifact that we can ship we can sign it we know for sure that version whatever is in production we can validate that if you want to update the operand you ship a new version of this whole thing if you want to update just the operator same deal a whole new artifact here in one of the benefits this gives us that's crucial and why this has been such a recommender best practice is that it couples here's a case we don't want to decouple we're coupling these two life cycles so your operator knows that it has a very small range of operand versions that it might need to manage you don't have to write software that can automate any version of that operand that's ever existed or importantly any version that might exist in the future we know these two versions are going together so we know exactly what to expect when that operator is interacting with its operand now what are we going to do with this model are we going to stick a model image reference because we definitely are seeing a lot of using container images just a raw container image as a vehicle to move the data that is the model image around delivered out to the edge it's a very natural thing to do so are we going to put a model image reference here well you can probably guess no we're not going to do that we want to life cycle that thing again independently separate from our release process of this operator and operand combination wide what are we going to do instead in our CRD this is where we're breaking the rule this is what we've largely argued against in terms of like don't just an operand like here's my postgres operator and put in your CRD here you can insert any image reference to any postgres image you want like we've mostly recommended against that sort of thing but here again this is data it's not a thing that is running and again we've got the edge scenario so we're optimizing for a very different world than typically we've thought about with operator design if we put that image reference in our CRD this enables us now to update it independently separately from everything else okay life cycle of the model let's dive into this a little deeper so we've got an operator that you know I've just got these generally operating this whole system right how are we going to get the model into this system well one pattern we see a lot of is and this is you know on kubernetes even or not a majority maybe maybe an awful lot of this kind of inference is happening in container images and of course an awful lot of that on kubernetes so using an init container that has the image in it so that when this inference system starts you get an init container that will then copy the model onto some shared storage maybe ephemeral shared storage is a very easy technique and like it's not that surprising to inject that into this inference system that's another option well back to the idea of we have some kind of edge local container image registry our operator can either directly or indirectly just grab that image and put it into maybe you have object storage maybe you have a model registry this is not a container registry this is a model registry that's maybe specific to the framework that you're using there's not really great standards yet but maybe they'll emerge soon in terms of those model registries having your operator more directly interact with taking the data out of that container image and putting it where it needs to go effectively in your local system is a very reasonable approach okay monitoring and reporting we've been looking at now for a few minutes these components have metrics very conveniently depending again on which stacks you're using which technology you choose, different metrics but the kinds of things you might see here are accounting for the fact that the real world changes so we're going to look at just the left side of this system for a moment and start with this real world data data changes over time when you're measuring the real world think about whether patterns are changing that okay so maybe your model is trained with some expectations that now are different maybe you're tracking vehicle traffic on the road or human traffic in a plaza for security or something and then there's a pandemic and all of a sudden those patterns are extremely different like you can imagine a lot of different reasons why real world data changes and so the effectiveness of your model by different metrics might decrease over time and it might need some retraining so these metrics are a great way to track that and your operator running on this device is a helpful and handy way to locally and again independently and autonomously keep track of how is this system doing and is there any action we need to take to improve things so let's say we've got Prometheus here very judiciously choosing so we want to pick out of there and track and using our operator or there's actually other tooling that your operator might consider orchestrating to monitor these metrics and then we're operators here take action what action might you take of course send an alert okay the effectiveness of this model has decreased to an extent that somebody needs to take action maybe we can do something a bit more automated you might need to stop something if this is a safety issue maybe this is a machine with human machine interaction and the model no longer is able to accurately keep track of where that human is for example or where is my hand relative to some machine in a factory it needs to stop that machine potentially maybe you are going to roll back a model update so say we just updated the model all of a sudden the effectiveness drops dramatically we're going to roll back that change until somebody can come up with something better or if we're getting real fancy we can maybe kick off a retraining pipeline and do retraining right there based on the data that we are actually seeing very circumstantial it depends that's like the theme of AI these days like it depends but these are all the kinds of things you might think about being within the sphere of influence of your operator okay what are some takeaways we want you to use operators to make your systems independent and autonomous like we're talking about the far edge that operator is an extension and a representative of you it is your opportunity to automate the things that you or like an SRE team might do in the middle of the night when some problem happens so that you don't actually have to get paged and of course with the device edge we're talking about tens maybe hundreds of thousands of devices at a scale at which you really can't afford to be addressing them directly anyway we want you to minimize the impact of your running operators so Varsha talked about a lot of things related to efficiency, smaller footprint those are all great strategies necessarily some of those aren't breaking the rules those are like re-emphasizing the original rules although some of it is like designing differently being very disciplined about what we include and then optimizing life cycle workflows for the challenges of the far edge you need to get this big data blob that is your model out to some remote location with maybe little bandwidth maybe a flaky network connection that kind of thing think carefully about how are you going to get updates maybe you're pre-staging them like your operator is the best place to be orchestrating that workflow and then double down on the vision that is these operators automate the domain specific tasks of course we see a lot of operators that start out with great intentions and a great aspiration but really turn into an installer and an upgrade and not a whole lot else this is your opportunity to use this pattern to the fullest automate your job automate those domain specific tasks so those are some ways to break the rules and now you can ask us questions or suggest some additional rules that need to be broken what do you got not all at once we could break the rule and not take questions well while you think about that here you can give us feedback we would love to hear your feedback especially if it's positive feedback we could share with our managers and it's been great to have you here we will be up here for a little bit I think there's a 30 minute coffee break now so we'll hang out and then we'll probably be around the red hat booth during the event tonight that's happening there thank you very much