 Thank you. Hello. Welcome everyone to Europe, Python. Good morning. Good afternoon. Good evening. We're from which time zone whichever time zone you're in. So this talk as evident from the title, it's going to be around managing complex applications on Kubernetes while staying in Python ecosystem. So Kubernetes is mostly written in Golan and it has the most active community in Golan, but I wanted to introduce some of the frameworks and tools and software patterns that we can actually code using Python while staying in Python ecosystem, we can automate all of that stuff. So for this particular talk, if you have some basic understanding of Kubernetes concepts like pods, deployments or services, it will be this where you'll be able to get the most out of this talk if you have knowledge around some of the basic Kubernetes concepts. A little bit about myself. I'm Gautam and I'm a software engineer from Grovers India, one of the largest online grocery shopping platform in the country. We run a fleet of more than 20 microservices on a Kubernetes cluster, which like on the extreme zone we go up to a million daily active users on those microservices they are written in Python, Flask, Django. And I completed my bachelor's in software engineering from Dali and technological university. I graduated back in 2018. I did GSOC with the library office. I love open source. I have contributed to a Mozilla Firefox for Android open MR vehicle record systems for social open event projects and others in the past. And this aggression said this is going to be my first talk in any conference for that matter so please bear bear with me. Okay, so I've divided this talk into four phases. In phase one, we are going to talk about introductions. We're going to introduce and discuss some problems scenarios that come from running applications in Kubernetes. We're going to discuss problems with configuration management setting up a database cluster. And then we're going to introduce a focus problem for the stock which is around running a silvery cluster on production. And since silvery is a very popular distributed task system written in Python, I chose it for the stock. And in phase two, we're going to generalize the learnings, all the manual steps that we do how what all the different pain points are of managing stateful in general in key communities. And we're going to discuss the cool for the silvery automation that we're going to do. We're going to solve each of the manual steps incrementally. And then we're going to discuss the extension capabilities and communities that are going to help us achieve that automation. In the phase three, we are going to build that solution incrementally and at each step, whatever manual steps we discussed, we're going to automate them and we are going to see them in action. We're going to see the custom silvery resource and then we'll see the operator operator is something that we'll discuss later in the talk, reacting to the events. We'll also see the auto scaling and downscaling of silvery workers based on q depth that is not really provided out of the box by a Q and it is. And in the phase four, we'll conclude we're going to see what is worth doing with operators for existing frameworks SDKs and other use cases and then we'll proceed to Q&A. Okay, let's start with the first problem. These are some real world scenarios they're going to, I'm going to discuss three real world scenarios as part of problems are rather called them opportunities where we can automate stuff. So this is a very common problem with configuration management in your communities when it is provides config map and secret object to actually manage your configuration on a cluster. And there is a very common problem with that, that whenever you need to change a value in config map you need to go and restart the corresponding deployment whenever that value is modified. So this is a very burning issue you can see from the number of reactions on this issue which hasn't given it is open source project and a one way of solving this would be to imagine like there were there is a watcher pod that was managing those config maps and deployment objects and as soon as that you change the config map values, it automatically triggered the relevant deployment. This is one of the opportunities that has potential of automation. Coming up to a slightly complex example, this is when you set up a database cluster, anything Postgres or MongoDB, running a database is actually easy. You just need to write a deployment spec, the declarative spec that you're going to provide. You are going to define a persistent volume. Then you're going to claim that volume into the deployment. Running is as simple as that. However, managing that cluster over the time is difficult. You need to set up connection pooling, you need to manage resize or upgrades in case space runs out or you're running on a lower version. And you need to take care of reconfiguration which requires operational expertise like from working with Postgres, I need to know it's internals, there is a previous generation templating and so on. These are some of the various problems like backups and recovery as well, which need an infrastructure operator to actually do all these things manually. Now coming to something that is very popular in Python ecosystem called Celery which is also going to be the focus of the stuff. As for the basics, what is Celery? Celery is a popular distributed task queue system and I work for an e-commerce company. So the typical use cases that we use Celery for are asynchronous workloads like sending emails, SMSs, or anything that is post-order has been placed, triggering cashbacks or promotions or rewards to the users based on the order status changes. So this is what we use Celery from and if you see on the bottom left, you see a very basic Celery flask application architecture that there is a flask application that pushes messages to a broker which could be Redis, and then Celery workers which are running the big tasks from that broker and process them and sends the result back to the broker or wherever you have configured. This is what a very simple flask Celery application looks like. You define a flask application, you define the broker URL which is Redis master in this case, you just define the result back end and there is a very simple task that is defined which simply adds the two numbers and returns the result back. So this is going to happen asynchronously later on. So, and this is the command that needs to be run to start a Celery worker, you need to provide the path to your Celery application, you need to pass the worker argument and then you have tons of configuration options, concurrency, logging level and all those things that are provided by Celery. So, this example that we saw was a very basic flask Celery application that you can set up on your local and try it out. Now, when you have to do this on production, when you have to deploy this flask Celery example on production, you need to on Kubernetes specifically, then you need to have a worker deployment YAML that looks somewhat like this, this kind deployment, how many number of replicas you want, how many number of workers you want, then you specify the containers like Celery or the container name Celery which is an image which you are going to pull and the command that is going to be running inside the containers and then there are different arguments like Q names, log level, concurrency and all those things, resource constraints that you can specify. This is one manual step that you need to write a worker deployment YAML. Then, when you're running on production, you need to set up monitoring as well, you need to make sure that your Celery workers are running all the time and your broker is healthy, your messages are being processed or not. And the factor standard for that is floor, floor or floor whatever you like to pronounce it with. So, the factor standard to monitor Celery's floor and then you need to also write a floor deployment spec and you need to expose that deployment as a service so that people who are outside the cluster can access and actually see whether the cluster is working fine or not. And then you have to also manage auto scaling. When you're running on production, you never know that when there is going to be a high workload or low workload. So you need to set up some kind of auto scaling in place where you actually want to might want to scale the workers on resource constraint like if my Celery or if my CPU or memory has increased beyond a certain limit, then you need to scale it. Or a very specific thing to Celery that if number of messages in your Redis queue are more or are increasing, you can simply scale the number of workers to maintain an average value, which is going to be processed by each worker. But this is not really supported in QtQ when it is directly. And summarizing these problems like running a Celery cluster on production, what all you need to do, there is a blog diagram on the right. So there is this worker deployment which we discussed, which is going to manage the Celery worker pods, there is flower deployment which is going to manage flower pods. There's flower service which is going to send the request traffic to flower pods and show the results back to the user. And then there's the the flask Celery simple example that we saw flask application is going to push messages to the broker and Celery worker pods will pick the messages and keep processing them. This is a typical Celery cluster in production now coming to the problems when you are going to manage this cluster on production. It's not easy to get a new setup right we saw all the manual steps that we needed to take a look and there is no way to actually set up multiple clusters in a consistent way because if you're working in a different team when you have more than 100 engineers who are using Celery for the different use cases, there is no way to set up any cluster in a consistent way. Everyone is configuring their own way. There are a lot of possibilities of misconfiguration because Celery and flower they both provide tons of configuration options. You might misconfigure conspiracy control or logging level or anything that can go wrong with production. And later on it also create problems with infrastructure audit this if you're multiple, nobody knows that how many resources are being used by this cluster whether it actually requires it or not. So all these things are problems when it comes to running Celery on production. Generalizing these learnings, these problems are opportunities. We can simply say that managing stateless on Kubernetes is easy but stateful applications like databases, caching systems, messaging systems. They need specific domain logic of how they are to be set up on a production and the scaled, upgraded or recovered in case any disaster happens in production for a typical business use case. And Kubernetes is designed for automation. It is possible to extend its behavior to manage all these complex applications while staying in Python ecosystem. And also there is one more problem that you need to bridge the gap between application engineers and infrastructure operators who actually manage these services. And next we are going to discuss the goals for this problem. Like as mentioned here, deploying and managing stateful software like Celery, it should be made easy for everyone. Kubernetes provide, Kubernetes have led to a wider adoption because of its declarative way of specifying the configuration. And if I could specify my Celery deployment, something like this, like there's this kind of Celery. And there's this common spec where I can provide my app name, my path to Celery app and then the image that are going to run and then the workers spec the number of workers and I've limited it to very simple configuration options, which Celery provides right now. And similarly, there is flower spec and resource constraints and all these things that you can configure. If I, as an application developer would be able to specify a YAML like this. I do nothing more than a Qubes ATL apply my spec dot YAML, it should be able to set up all the worker deployments they're monitoring their scaling automatically in the best way possible. This is the goal for this talk that this is what we are going to achieve in the end. Kubernetes should be able to understand the spec and take actions accordingly on the different events that are going to happen. So as I discussed, there is this kind Celery which is specified here. Now, Kubernetes does not know out of the box what is Celery or what is your Postgres and any other database. It knows what is a deployment, it knows what is a pod, it knows what is a service. So it is possible to extend that behavior using a concept called CRT, custom resource definitions. I can define my custom resource in Kubernetes and I can extend Kubernetes APIs to actually understand that custom resource named Celery. So it also let you provide a structured schema what all workers pick flower spec that we saw it lets you define the structure schema of that custom object. And that helps in standardizing the specification across the Kubernetes cluster that you are running for the multiple Celery application. So this is the block diagram that we saw there was this native case object, there was worker deployment, flower deployment service that we saw earlier in the talk. And then there is the CRD which is which I'm going to define the CRD and then custom resource and then the resource. That resource that Celery resource which we saw will have some some sort of status, and it is going to pass through some sort of logic that we are going to discuss next. But this should be able to the Kubernetes should be able to understand all this. That's the whole aim. So how will this custom resource definition for Celery look like this custom resource definition will look somewhat like this. It is it. I define a custom resource definition I type the metadata I type the kind Celery and the short names and I specify opening the three schema for this object. I'm going to show you this in full. How does this. Just a second. Okay. Yeah, so this is what my custom resource definition for Celery looks like. This is a very proof of concept version right now it's not a fully fledged production was to make the talk simple. And this is the spec that I'm expecting this is the common spec the common configuration parameters that I can pass in then there's this worker spec, which can accept all these number of properties like number of workers you can choose log level concurrency, and then there's flower spec and towards the end we have the auto scaling targets as well. So this is how this custom Celery resource will look like and with the help of this Kubernetes cluster will be able to understand my custom resource of Celery that looks like this kind Celery and the common parameter worker spec flower spec and all the scaling targets that I've specified. So, a simple way of creating this custom resource definition that is defined is using a cube city and apply the ploy slash CRD dot yaml. If you get the CRD is you'll see that Celery is created Celery project Celery, the CRD is created. And when I create my custom resource which is the ploy slash CR dot yaml every I can also get the Celery applications that are running currently on my cluster. Now, if you see that right now nothing will happen Kubernetes is just able to recognize that there is some Celery resource that has come in and I have to accept it and have to store it in the database. That's it. Now coming to something that is going to react that is going to make all those automation happen. Now we are going to talk about controllers controllers in Kubernetes are at the core of its self healing capabilities and they continuously execute control loops for all the API objects they are watching. On the side right side there is a very simple example of replica set controller. So you specified that I need number of pods that are equal to three. Then this replica set controller constantly runs a loop control loop that is going to make sure that this number is always there in the system it continuously checks the observed state it takes the decision that whether the number of parts is more than three or less than three it will create more parts and delete extra parts accordingly and it will eventually make sure that your desired state is reached. Now, like the communities also provides flexibility to write custom controllers to manage your custom resources or custom resources that I created with Celery and I can write a custom controller that is going to be watching my Celery resource and take appropriate actions. So, coming to this reconciliation loop so Kubernetes works on this concept of level triggered versus edge triggered, which we, which some of you might have studied in electronics. So this level, what happens in level triggered is that when a signal goes from zero to one, there is a loop that continuously executes on that level until that signal comes down to zero. And similarly in the edge triggered concept what happens it's when the state changes from zero to one, then only your code your thing will be executed but in level triggered it continuously make sure that your signal should go to zero or your signal should go from zero to one while it is there. So, okay, coming to the next part when you combine CRDs with custom controllers that you're going to define you build this thing called operator pattern. Operator pattern is a way of managing. It's a software pattern design pattern which lets you actually manage complex applications in Kubernetes, it'll take care of creating scaling upgrades recovery and more. And later in the stock we are going to actually code that controller, which custom controller, which we saw, and operators are simply these software that actually extend the native Kubernetes abilities to reliably manage all these complex applications. It was, it was introduced by Coros which is now acquired by Red Hat. You can simply call them a Kubernetes native app similar to what you have for Android apps or Android exposes the APIs on which you can build apps similarly Kubernetes exposes APIs to build apps for itself and operator pattern is one of the design patterns you can follow to actually build a Kubernetes native app. And again, all operators are controllers but not every controller is an operator. It was a very important distinction that controller is what we saw earlier. It's, it could be a very generic one that just runs a reconciliation loop like replica set controller deployment control. But operators are actually custom controllers that have the operational knowledge baked in in their code. Now, coming to the implementation operators can be written in any language runtime, which can actually interact with Kubernetes API. And this talk specifically encourages writing operators and supporting frameworks and Python ecosystem. Right now Golang is a popular choice because the whole Kubernetes is written in Golang. So, this is, this is this talk is about how, what all things you can achieve while staying in the Python ecosystem with Kubernetes. And there are a lot of examples that the existing operators that are out there, like Prometheus operators there at CD operator is there MongoDB operators there. It's as simple as installing these operators and they're going to take care of managing the whole cluster for you. Right. Coming to coming to the controller part that now we are going to control. Now we are going to implement our custom controller for celery. So, let's start with creation. Okay, so whenever I created my new salary resource. There should be something that is going to react and bring the all the worker deployments flower deployments and flower service that all the manual steps that we did for running a fast production. So, I have used a popular framework called K opf Kubernetes operated by tonic framework it's open source by Zalando, I see it's German based e-commerce company. So, the general idea of that framework is that it, it takes care of interacting with the Kubernetes API automatically and it exposes the handlers that you can code. So, you just need that domain expertise to code in Python and everything else will be taken care of by the you just you don't need to know the Kubernetes internals to actually write a controller. So, this is a simple watch that I'm doing on my salary resource. I create this. This is this handler is going to be fired when I'm actually creating my salary resource and then number one it's going to validate the spec. If the incoming spec that you are specified as valid or not and then instantiates the Kubernetes APIs and it deploys the workers flower and services are not going to these are simple utility functions that just hit the Kubernetes API using a YAML and this return the, this return the updated number of children that all the children that it has created, they're going to go in this response of this create function. So, if you can see in the block diagram this creation handler is watching the custom resource and it sends back the status as all the children that it has created back to resource status. Now, I've talked enough. Let's see some something in action. So I've made a demo video. Right, so this first of all, we're going to see the CRD and CR creation I talked about how do you create a CRD and this CRD.yaml we see now. Now Kubernetes cluster will be able to recognize the selvi resources I created. I'm now going to create my custom salary resource as well. So I recorded a video I wanted to do this live but like my system is kind of low on RAM when you're doing a video sharing and all those things. So, okay, now I did that. I created the custom resource. Now I'm going to deploy my handler or my operator. Just a second. As soon as you see on the right have created a watch on the boards as soon as the operator comes in, it's going to create it's going to identify that I created a custom salary resource and it's going to identify that I created it. And it's going to execute all those things and it's going to create deployment for salary workers and flower service and all those things automatically. Now we're going to see if our cluster is in healthy state by checking the floor. This is my service that has created flower. Okay, yeah, so this is the flower UI that is that is I have two salary workers that are currently online I have not started pushing in tasks yet I can simply monitor the application like this. Okay, so this is what we saw the creation handler in action right now. And similarly if I do. I'm going to check the status of creation handler that is going to be stored in these custom object that I created. And if you see this creation, this creation function is sorry. That's the problem with video. You cannot go along what you're saying. So this is the creation function it's going to store all the children it created deployment, how many number of replicas it did services and the configuration for that service. Okay, so moving back to the PPT. Now, what we saw was the creation handler. So we might want to edit our cluster configuration in production when it is it's running on production so you need some sort of keep updation capabilities as well. So based for the ablation there is this update function that now I just got the diff from like what all things that I updated I'm going to know that what all different spec that was updated if I specified modified the common spec then I need to update all the deployments for the humidity for the salary cluster. And if I just modified the worker spec I need to modify the worker deployments. Similarly for the flower and the update ablation handler again it's going to return the result back to back to the resource status, which we saw. So let's see the same in action. I'm going to edit my salary resource. So I'm just going to leave the custom spec as it is I'm going to make my flower replicas to do. I'm going to change my workers back to let's say I want my concurrency back to one and log level to info number of workers to be for instead of and you can see the current state of the cluster all the pods that are running in the cluster. And as soon as I edited my handler invoked the application handler and it created more workers for me. So this is eventually going to reach that state of four workers right now the deployment strategy set is rolling updates it's going to do that rolling update with the max it's possible. Okay, perfect. So now moving on to the auto scaling part, which is the coolest part of this having this operator. So how do you actually handle that you need to auto scale your worker based on Q length. So there is a handler that there's a timer that runs every x seconds and it actually hits the flower service to know what is the current status and what is the current queue length in the broker, and it publishes back to that status. Okay, so this is as simple as you know just publishing the message queue and that is from taking in from the flower service to the resource status. Coming to the auto scaling part. There is a watcher that actually there is a handler that is watching the published queue length. And as soon as that is changed it is triggered. So it actually takes in the word is the number of current replicas what all the state scaling targets are. What are the number of replicas max replicas and it's going to it's through a simple algorithm it's it's going to make sure that updated number of replicas are equal to the average number. And it's going to patch that worker deployment to actually make that happen. So this is the block diagram simply just watches the queue it triggers scaling of the worker deployments based on whatever queue length it is getting. So, now if you see right now auto scale handler demo. Okay, so if you see on the bottom you see this there's a message queue length timer that is being invoked every 10 seconds. And it's right now I'm not pushing any queue. So, anything that in the greatest broker and I create just created a flask example that is going to bombard or reduce queue with number of tasks. And now the changes will start to happen. And as soon as the status changes, like the number of worker number of messages increased. The auto scale handler was triggered and it increased the number of workers to handle that increased load. As you can see it updated the number of replicas to five, which was our limit as we are currently bombarding the task queue. And as you can see that there are all the workers are online, and they're processing continuously. So I'm going to delete my flask application to see that whether the downscaling is going to happen as the number of messages is going down. So let's wait for a bit so that all these silvery messages which are being processed they go down a number. The changes have started happening. As soon as the application went down, which is pushing, it's going to terminate all the extra parts and state care of managing this whole cluster automatically. So, you can see that this is really cool stuff right. I just needed to do, I just needed to deploy that operator YAML once, and I just needed to specify my custom resource as a declarative spec. And that is it and it is going to take care of setting up to updation to actually scaling the resource automatically. So coming to the like merging all those diagrams that we saw previously during the talk we started from here with a very basic flask silvery example. Is Redis master is there, it's going to, it's the queue is being consumed by silvery worker pods. And then there's a creation handler that we saw it's going to create all these nodes updation handler is going to update all these deployments. And similarly, this is this is a flow child of the whole operator that I built for this talk. It's a it's a proof of concept stage it has still has a way to go for production. But yeah, this is this is something interesting that I wanted to share with the community. Okay. So we're moving towards the end of the talk. Like what all the things we talked about we saw the problems and opportunities from an example application on communities. So the manual steps, what what all steps that we need to do to actually launch silvery on a production. Kubernetes cluster and then we saw the goals, what should have what I want to have as an application developer or an infrastructure engineer. We saw the extension capabilities controllers and operator pattern and CRT. And we saw the creation updation and auto scaling implementation in action. Okay, so next steps for this project. This was like, I started I wanted to learn about operators in general so I created this project. This is live and open source on my GitHub, and there is still some way to go for making a production ready if you're running celery on production if you're running Kubernetes and production. You're more than welcome to actually tell me about like what I'll suggest, what all we can improve on this operator and I'm going to be committing certain our numbers weekly based on the feedback that I get from this conference and the other ones and an odd star aim for this operator who could be to actually include it with the celery five release milestone I'm yet to discuss it with the celery maintainers. And there is an ongoing discussion on celery enhancement proposals repo around the same that they wish to have a head chart or an operator for celery. So this is going to be exciting. Okay, so what I'll what are different people doing with operators. It's a relatively new concept it was introduced back in 2016 itself. There, there is a repository called awesome operators. You can see all the awesome operators that people have built just in go land, Python, and all the other languages. There is a registry of operators as well. Like this operator hub.io there are different operators for all these applications that are very famous and used in production clusters for me to say that look out space MongoDB console and all these applications. The idea that I wanted to share like when you're actually running fleet of more than 100 200 microservices when you're running at a scale of let's say companies like Pinterest Instagram so any there could be an operator that lets you set up a new business and it will inject the standard pieces like containers volumes logging monitoring and created the final dashboard automatically for you which which are rather manual tasks whenever your infrastructure engineer is usually involved. And there are different frameworks and resources to build operators that I discussed about communities. Like KOPF which is open source by Zalando I see it is a Python framework. There's operator SDK which is in Golan. Similarly Golan has multiple resources. It was officially launched for Golan. And then there is this meta controller which is a Kubernetes plugin that makes it right to easier to write custom controllers in every language. And as I said before, this is the stock is going to like was an aim to my aim for the stock was to actually introduce operator back into Python community and what all we can do to actually make this community more mature this to build the Kubernetes native apps. All right. I have spoken enough. So, okay, awesome. I'm going to it's time for Q&A. Okay, so here's the first question. Do you have any front end to see the task process across the scaling activity. The task process to see the, no, I don't have, I haven't worked with Minicube dashboard but I'm sure that Minicube dashboard must be providing this thing. So, I'm going to, I don't have it right now, but like what a demo I showed you that that is it and I'm going to try it out. And we can take this discussion back to the breakout room. Okay, so an anonymous attendee would want you to share your slides and the example code that you shared in GitHub. Yeah, I have. I think I have shared the link to repository it's on my. I'm going to post it in my breakout room. So you're more than welcome to just take a look at the code. It's not really production ready so please bear with that. Like, there's still some way to go and we're working actively to actually improve it. Okay, so here's the third question. Is there any problems with failed tasks during downscaling with failed tasks. No, so I think this downscaling is automatic. So deployment as a whole. Actually manages automatically like all the tasks, the salary workers are only going to pick the tasks when they are ready and when they are killed immediately they stop the tasks. And in this case, I think it was fine. Like, we saw the flower monitoring that none of the actual tasks actually failed or none of the workers were actually killed while they were processing the task so I think salary and fly the salary, it automatically takes care of that. Whenever the worker is killed it's not killed a missed, you know, running running a message when it's processing a message. So actually I have a question. So, could you maybe tell us the difference between having a Helm chart and having a custom operator for the same use case. Why would you go to Helm chart and why would you have an operator. Yeah. So it's a great question that I was expecting as well from people. So Hem chart is, I would say these two concepts are complimentary the operator pattern operator pattern is a software design pattern that is that talks about actually, including the operational expertise you have and the Hem chart is more like homebrew for macOS it's like a package manager for Kubernetes it's going to help you install all those things. But if you were to actually, you know, push it to the limits like a Hem chart was originally designed for just for the package management for the humanities. But it has the capabilities to build an operator you can actually just install an operator by using a Hem chart as well. So these concepts go hand in hand rather than they are not a versus like it's not a showdown that you should have you should do operators. Yeah, yeah, thank you. So yeah, thank you for the talk. I think you just demystified communities and operators for a lot of people that including me. So I think we should have a big round of applause for you.