 Okay, we're going to get started. I'd like to welcome everyone to today's CNCF webinar production get-ops in practice. I'm Jerry Fallon and I will be moderating today's webinar. We would like to welcome our presenter today, Rick Spencer, head of platform and inflex data. Just a few housekeeping items before we get started. During the webinar, you're not able to talk as an attendee. There is a Q&A box at the bottom of your screen, so please feel free to drop your questions in there and we'll get to as many as we can at the end. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that are in violation of the Code of Conduct. Please be respectful of your fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF webinar page at cncf.io slash webinar. And with that, I will hand it over to Rick for today's presentation. Thank you so much. I hope everyone can hear me okay. I guess I'll get some feedback. If you can't, I'm really happy to be here. I really appreciate the opportunity to talk about the way we use GitOps in our Kubernetes practice at InfluxDB. So first, to just provide some context, I thought it would be useful to just talk a little bit about Influx data. So what Influx data is at its heart is a time series database, although InfluxDB has grown to be a platform on top of that where people can write custom monitoring applications, custom IoT applications, people write custom finance applications. Anything where the time dimension is the most important dimension of your data, InfluxDB is a really good place to develop that application. So our special kind of database with a platform on top of it to make it easy to write applications for it. So I always put this slide here to stop to remind me to talk about open source. This is open valuing open source is one of our custom companies core values. We do have an open source version of InfluxDB. We maintain a lot of open source projects. The purpose of this talk, it's important because while many people use InfluxDB to monitor their Kubernetes installations or their applications running there, we do not see ourselves as a Kubernetes tooling company. So what we really don't want to end up with is maintaining a set of bespoke Kubernetes deployment tools. So we opt to try to find out like what's the most supported in the existing open source community. Pick what matches our use case the best. And then if we do need to write custom code, we prefer to do that code upstream within the projects that we're using. So I will touch on some of the open source projects that we use as we go. Okay, so the main product that I focus on in my role is called InfluxDB cloud. Informally we call it cloud two because there was a cloud one product cloud. InfluxDB cloud is a Kubernetes application. Obviously, since it's got a database at its heart, it's very stateful. So we talked to you a bit about what cloud two looks like in production. Okay, what and why this will build to why GitOps was a useful methodology for us. So cloud two is a fully SAS data platform. So it's like you sign up for an account. You don't have you don't host any servers yourself. We do like all the server maintenance hosting and everything. But the thing about a database is there's this notion for customers like that their data has gravity. And by this we don't mean that in like a marketing sense. It's almost more literal than figurative. There data has gravity in the sense that if you have like terabytes of data, it's very costly and slow to move it between regions within a cloud or between clouds, etc. They really want their data services to be as close as possible to where they're storing the data themselves. So they're generating data in Europe. They want to be able to, you know, operate on their data in Europe store it there operate it there they don't want to copy it to the US. They're generating their data in the US. They want to be able to store it and operate it close to where they're they're creating it there so Additionally, people like they may be generating services with a one cloud provider. So if they're you know generating in AWS or GCP or Azure they want to be able to store the data within those cloud providers. We are multi cloud and we're multi region right so we operate cloud to or influx DB cloud we are operating it in multiple regions on multiple clouds. Additionally, soon, we are going to start providing customers with private instances. And so, in addition to all the public instances that we have will have a collection of private instances that that will be available, you know to customers in the region of their choosing on the cloud provider they're choosing. So for this reason, very early in the project very early Kubernetes was chosen as the deployment as the operating platform because it does provide a cloud abstraction layer that really allows us to be multi cloud be multi region have like all these different instances, and have a same way to operate all those in production in mission critical way. So, Okay, so we decided to go for get off in order to maintain our continuous deployment through all these areas. And what I want to try to communicate here is what what we mean by get off so there's various definitions. I know that the term was coined. Very specifically by we've works but like, I don't think they would disagree with this what we say like we're doing get off or we set out to say we're going to do get off to say like, we know we will have achieved it. When these signifiers are true. So first, that developers get code into production by landing in the main branch right so once your code makes it through the gauntlet of code reviews and see I and gets merged into the main branch. Automation takes over and takes it from that get repository, all the way into production. So, whatever it takes to get into the main branch. That's all it takes for the developer to get that code out into production. And then for us also infrastructure is code should be delivered in the similar manner. Right so it's one thing to say I've changed the service I've changed the code in the service. And I've gotten that code through a pull request into the main branch. What about the infrastructure itself, right, like the Kubernetes infrastructure, as well as the architecture of the application. All of that should be delivered in a similar manner so if I will. For us will key signifier doing get ops is that in order to change the infrastructure of your application or your Kubernetes cluster itself that again is done via a code check in into a get repository. And finally, is the notion of testing and production. Right and what that means is that you have a facility in place where developers can get their code in the main branch. Even if it is not currently ready for users, it can still be in production, just not available to users and production or only selectively users selectively available to users in production. It's important because you want to, we want to keep up a pace where people are just constantly getting a code into the main branch and we never have big events where we're trying to like, you know, integrate a lot of code all at once. So these are the three signifiers that we set out for ourselves and I'll refer back to this a bit later in the discussion. So, now, I'm going to talk I'm going to show some diagrams about the way that that we visualize get up so before I do I want to say like, part of our company culture is really about like transparency and learning. So when you like see these diagrams, they will look like, you know, rainbows and unicorns and butterflies but there's actually a lot of blood sweat and tears that went to get into that state. There's a lot of complexity that's hidden by the simplicity of the diagrams. So, I'll try to touch on some of the things like a lot of times you'll go to a presentation like this, and you'll like leave thinking like wow like those engineers like they really like have it together that's amazing. But then like when you actually join the team you see like that it's actually a different story that it's like actually really hard to do what they're doing. I'm going to try as much as possible to give like a more realistic viewpoint of what doing get offs is like. Okay, so this as I said is our idealized diagram. And this is an idealized view of our overall environment and so I'll just walk you through it real quick. So imagine I change to service. Okay, so here's some code that I wrote. And I do a PR to the main branch here. And then our CI system. This is actually like way more complicated right because of course CI like you do a PR the CI system doesn't put it actually allow it, the PR to complete before the CI is completed. So what CI does is not just runs tests and integrates the code, but it also builds containers test the containers, and in the case that the containers pass everything and appear to be ready to go. The CI system will then update our configuration code, which happens to be JSON it and I'll talk about a bit more about later to say hey, the Kubernetes cluster, the application should now point to this new container for this service right and then that configuration gets updated automatically by the CI system. And then we have what we call waves of deployment. So the first thing is our go to text like hey, the configuration code for the staging environment has been updated. The configuration for the staging environment has been updated. Let me tell Kubernetes to apply that those changes. Once those changes are applied, we run some tests here. Right, so these are like automatically run by our workflow. If that works, then the Kubernetes configuration for the next wave is updated and it rolls out to what we call internal production informally internally we call it tools. The reason we call it internal production is because only we have access to it. But it's where we do all the monitoring for everything. So when any whenever there's any hiccup in tools, we treat it as a full blown production event. Right, because if like our, our internal production environment is is not operating optimally, then potentially our monitoring of the external productions is not optimal. So we, you know, that's why we use the term production in it because like we treat it fully as a production environment. Again, we run some smoke test here. If those pass, then we update the environments for all the external production external production. Are all the Kubernetes clusters and and instances of the application that end users use that customers use externally. Okay, so there's some like caveats and some more details that I want to cover in this part. So first, when we update the code, it's very like easy. It was easy to write, you know, tooling and a workflow that would do these waves. It's much harder with the configuration right because like the code. When we update the code, it says hey configuration. Can you please update this, you know, new update to the new hash of the container. But what if we're like changing part of the architecture like we're changing like we're scaling something out or, or whatever, we do check that in this code but we don't really these wave options. We actually have to do manually if we do it. Often we don't will just like just like change everything all at once. Usually, usually that's fine but if you have to be careful, we still don't have automated a way to like go through these waves if it's a con if there's no code change but it's just kind of. Okay, want to touch on the staging environment and I'll touch on this later as well. We still have a lot of conversations internally about staging. If you look at the literature, like what our thought process here is like okay, we're going to land on these small instances that are much cheaper to run. They are not mission critical, and we'll run tests on those, and therefore we will have like a way of getting confidence that the change that's being rolled out is free of flaws right and it makes makes a lot of sense. But if you look at the literature, the DevOps literature a lot of people will tell you like, you're just you're dreaming if you think staging is going to help you in that way. The staging will never be configured the same and it will never get the same load. So, we still have a lot of internal discussions about whether we should even have the staging environment or not. For the time being it's working for us more than I think it's hurting us so we're not feeling a lot of pressure to change, but it's something to keep in mind and I'll touch on it a little bit later also. So these, this is what we call this actually the automated feedback loop to set up a way that like we can deploy to a wave either here or here. Actually, we do it down here too, but since this is the end of the line. There's, there's no way to like it's good to know that there was a problem, but there's no way to block the next way since there is no next way. But right now this is a set of smoke tests. We're not using any of the monitoring and alerting that I'm talking about yet to be part of that automated feedback loop, but I have some slides later where we can go into a little bit more detail on that on those concepts. Okay, just in terms of like my comment about rainbows and unicorns and butterflies. Don't look at this. Okay, it's totally out of date, but this is really if you start to really unpack everything that's going on in that internalized view, you'll see a lot more components come into play, and a lot more complexity. So, while I conceptually, you can represent a get off workflow in a pretty simple way if you really start to like, you know, dive into it and pull off the pieces you can see it gets rather, rather complex. Okay, so when after we went through like all the effort to get to that state that I just just explained. We started to see some like really clear benefits. First was like, like logical we were just shocked about how powerful it was was just the amount of time that developers got back, because they were not working on release process. Right, like, like every day when like code was at a logical stopping point, they made a pull request. And that was it. So there was no effort to like batch up releases, babysit releases, the final release. There's no idea of pushing a release, you know just like developers can just like let it go and that gave them a lot, a lot more time to develop. Similarly, and maybe counterintuitively for some people, like our incidences, our rate of incidents in production went down, like, dramatically as soon as we started to automate the deployments through get ups. So we think that there's like two, like we went and we looked at like our incident RCAs and everything. The two main contributors to the decreasing incidents was first. There are just many fewer complicated releases like the changes. So, you know, rolling out changes like today we do changes like deploy like probably like three to five times on a typical day, maybe more on some days. And like the changes were like much smaller that were going out. And so their ability those changes ability to have a big impact was reduced. Most of us like very obvious what caused the problem. The problem went out so like smaller easier to diagnose problems, but also we did learn in one case. It's possible like since we have a computer program rolling out the changes to make errors on a much bigger scale. But it's much easier to recover from those errors. And in fact, there's just the reduction in errors caused by humans doing deployments was just a really dramatic improvement for our robustness and availability. You know, just like a computer program doesn't forget to do something. A computer program does things in the order that it was told so. So that was like, you know, just really those two things just to decrease in working on release process and the fewer incidents really immediately added a lot to just the quality of life for the developers on our team. Also, like, you know, one of our tasks is to add more regions. You know, the customer wants to work in region through since everything is automated. Everything is captured, all the configuration is his code, like all we have to do is just set up some files to define the region, and then our existing tooling automatically will roll out to that region and so, you know, it's like a single digit day process from beginning to end. Honestly, the thing that takes us the longest now in rolling out regions is just getting the, the machines from the cloud service providers, you know, another thing just in terms of everything is just because what's now what's in master or what's in the main branch is either in production, or on its way to production, just everything is easier just so much easier to reason about everything because if you want to know what's running anywhere, like, it's, it's just obvious right so there's no, there's no like dipping between the different environments, different between a production environment and the code. So, like that also has been like just an improvement to our robustness there. Okay. So now this part is a little bit more for leaders or managers right so I'm imagining like you're like wow, I want those I want those uniforms and I want those rainbows. So now this is just a little bit about like how how to get there from an engineering management point of view. Okay, so we broke it down and kind of like required elements, like this is a bit retrospective, like what was required, versus like what were some optional components. So obviously, you need a CI pipeline to build artifacts and to run all the tests and everything. If currently today, your deployment pipeline involves developers building containers themselves, like then you're like that's like a really first problem to fix like you just you'd if like your developers are building containers and then pushing them into production then you're simply not going to be able to rely on automation around to get repository to make sure that happens. Again, you need a way to represent your infrastructure as code. For the application itself. We use jsonnet for this as an aside we use Terraform for the Kubernetes clusters themselves but we use jsonnet and I'll show you a little bit about why we use jsonnet why it's beneficial for us, but the combination of jsonnet in cube CFG makes it really easy for us to represent the infrastructure as code in a multi cloud scenario. You need a way to deploy Kubernetes configuration right so once the configuration changes and get you need something that knows how to tell your Kubernetes cluster like make the application look like what's in get we use Argo for that. One thing that you like really need to do is make sure that your deployment pipeline is fast and this is actually an area where we right now have mixed success. If you're one of the things about like deploying to Kubernetes and I'll talk about this a little bit later also is like during deployments right there's containers coming and going and making those deployments smooth is a lot of work. Right so if you can imagine your services are the containers are being killed and you know the pods are being replaced by Kubernetes during the deployment. There can be availability issues during that time so faster it is the fewer those problems that that you might run into. Additionally if you think about just recovering right like so hey there's a problem like so for instance any problems that are UI tier we can roll back and within a matter of minutes we're back to the old UI assuming that you know it was a reasonably uncomplicated deployment because that deployment pipeline is fast we have a lot of confidence because it's really easy to recover. There's other areas of the product where the deployments are a bit more complicated. This may affect us more than it'll affect most people because we are a database or a stateful but managing those like state minimal to the deployments we do have some work to do to make that faster. But our deployments are fast enough. If your deployments are very slow you may find that GitOps is a bit problematic because for us one of the important signifiers is that you're keeping all your code in the main branch even if it's not ready for users you need some some way of feature flagging. Now we actually started feature flagging without any kind of service so when we started with it it was just totally manual. Currently we use the config cat service of a service like config cat or there's you know plenty of other ones like they offer a lot of value in terms of using feature flags like for product managers to be able to send you know specific users to try new features and things. And then of course most importantly you must must have production metrics and especially reliable alerts right so if like your developers are going to make a pull request into the main branch and then that's done it was accepted my job is done and turn their attention elsewhere. They cannot do this with confidence unless they know if there is a problem they will be alerted otherwise they will make a pull request and then stare at a screen until they're confident that the that there was no problem and making a screen that gives you that confidence is actually harder than it sounds because a lot more can go wrong than what you predict so reliable production metrics and alerts is really one of the core things that you need to build to build in and I'll talk about how we did that. So there's some components that we believe are optional. And some of them may be surprising that we consider optional. So for instance the automated feedback loop we actually switched to get ops and doing a continuous deployments without any automated feedback loop so there was no way for us to like stop a deployment. In between waves when we first started but we had so much when just by having an automated that like that. That it was worth it for us. Now we do have automated feedback loops and we do have deployment waves and we are. improving that process as we go. Another optional thing is like we call it the pipeline for a high level operations. An example of a high level operate operation is like running a set of tasks after you do a deployment. We use our workflow for that. Canary deployments. This is actually a next step for us being able to do canary deployments. A lot of people. There's different opinions in the literature about the value of canary deployments versus other kind of deployment methodologies. But we found that we got so much value from get ops without it that we actually deferred implementing it now we're starting to work on that. Another optional but really desirable piece something that really added a lot of value to us is this concept that we call external availability validation. So we ended up like saying like what is the user's view of availability right so we're developing we're getting all these metrics from inside the system and we can see like how many users are getting 503. Or you know 400 and that kind of thing. And but what we found was really valuable is actually running some processes from outside that just use the API and just measure like how many how successful they are. I'll talk about that in a bit but that was a really useful sort of rallying point. For the team so we call it external availability validation and then finally already touched on this concept of testing or or acceptance environments. We consider it optional like a lot of people in the literature consider it an absolute anti pattern so I won't take a stance on that but I always find it interesting conversation. Okay, so I wanted to touch a little bit on like I mentioned jsonnet and like why we like jsonnet so what is jsonnet so jsonnet is a super set of json and it's designed to allow you to overlay configurations on top of other configurations and this is very very useful for us because we have a base configuration for our application and then we can overlay specializations for the different cloud environments or even the different regions you know one region is utilized much more we may we may specialize the configuration allows us to maintain all of that configuration as code but in a very same way. And so just how that roughly works is if you see here like here's a this is a contrived example of base configuration and the main thing to look at here is it describes replicas and it says one replica etc. Then we say okay that's fine for the base configuration but we need to do something different in AWS. In AWS we need four replicas and furthermore we need to add some some limit requests here need some limits so okay so this then this tool called cube CFG then can take these two things and output yaml. And so you can see in this yaml this would be the yaml for the AWS prod which you can see has all of this content plus these overlaid different content here and in this way we can have sanity for maintaining all these different regions. So what happens is that the cube config tool will cause more yaml to be output will we will automatically check that output of yaml back into get and so that change in get is what tells Argo hey there's been a change to the configuration let me roll that configuration out to production. Okay, so there's other ways of doing this. There are other projects. This we've just found this to be just work really well for us conceptually as well as in practice. Okay, so I want to touch on something this may not be related to everybody, but for us it turned out what we call super smooth deployments was was super important. So for our users, it turned out they had a very very high expectations for availability their API calls, like as soon as it falls below five nines like they feel like I got a error in my API call this week I got one this week, like, you know that to them is too much. Right. So, but, as I mentioned before, like deployments themselves cause pods to restart. And so, or to get killed and be replaced right so frequent deployments cause frequent pod restarts that can cause. You know users to experience poor availability, or at least the perceived poor availability from their point of view because if you imagine you know, like if you make a request through the gateway front end. It goes to the compute here it goes to the storage here, and then all the way back and back out to the user and like pods can be restarted anywhere in that change during that that time period. Like any you know one call can, you know, anyone failure during an appointment there can cause the user to get back hey I was unavailable. You know, so we put in actually a lot of effort in our custom controllers I did I asked like a lot of trusted advisors, people who have been in the Kubernetes project from the beginning. Some did not answer me. Some, some answered me with just like yeah that's really hard, but we did get some really good advice, besides using the custom controller that our contrast, our custom controller code ensures pop proper ordering of everything. So we had to invest a lot in that to really make sure that like during the deployment process, you know the load balancer the gateway, the compute here all of that was shut down and read, you know, like the connections drain and everything, like an exactly the right order in the right amount of time before Istio gave up and just shut everything down by itself, etc. So a lot of custom code there, but one of the main things was, you know, like what I call like be one with the retries, right so if a pod gets restarted by a deployment, it will return of something in the 500 area like a 503, which means like the server was unavailable for that call. Can you please try again. Right. So we had do have a lot of code then to just like do a retry between the tears if you got that we have an extra challenge though, because if you're working with a database, many of the requests can have side effects. Right, so you can't just blindly retry something that has side effects. So there's many cases where we can't just retry for the solution but you know that's where customer control code comes in. Now, most people that I talked to, who are using Kubernetes, they simply do not have these high requirements for availability. So like they don't know most most of my friends in the in the who are using Kubernetes like don't really relate to like to this level of pressure to keep the deployment smooth. Okay, moving on. Okay, so now I want to talk a bit about how we do get metrics. So we gather all of the metrics for a few Bernidis cluster with Telegraph Telegraph. It's an agent for collecting metrics. It's got many, many input plugins and a few output plugins. Of course, the main output plug and being in flex DB. That's because this is an open source project. Telegraph is designed to be run as an agent to be run as a server. It's designed to be run on command it's designed to be run with long lasting connections. It's a very useful piece of software just for transparency in flex data the company we are the primary maintainers of it and it's within our it's in within our repositories but there's quite a big community around Telegraph and there are output plugins for many other targets, you know, like, I know, like Azure, AWS, a bunch of other companies have made plugins for it. So we use it. We like it because it's written entirely in Go. So we just get a single binary, which makes it easy for us to drop in as a sidecar into our pods. So we have like Telegraph there running as a sidecar streaming metrics back and we get a lot of metrics back like very, a lot of metrics. So one thing we get is application specific metrics, you know, like the different applications that we use off the shelf. They have their own metrics that they that they provide, but then we also write a lot of custom metrics like our services. Write metrics out Telegraph. We send those to Telegraph and Telegraph sends that back to our monitoring instance. And then there's Istio and other services that are running. Those are also all providing metrics of Telegraph. We have it in just different places in the cluster just streaming metrics back. So how do you use metrics. So there's a few areas where you like that are important to use metrics in a GitOps scenario. And the first is that you just need to get alerted on urgent issues. As I said, you want your developers to make a pull request, have the pull request succeed, and then be able to turn their attention somewhere, but get alerted and the unlikely case that something's gone wrong. Right. So alerts on urgent issues. That's the sort of the number one focus we found that that was the most important work that we did. But second most important, I would say we're learning on leading indicators, you know, we found like, hey, if you know this, these certain cues grow in length, or if the latency between these, you know, services starts to grow. That's an indicator of an upcoming problem. Right. And so like we have alerts that tell us, you know, it looks like something's going to go wrong. Of course, both of these, it's very dynamic, because like, when you like get these alerts, of course you go in and you solve the core problems, but then you need alerts for like, you know, the new potential sources of problems that can come up. We have also dashboards for all system health, we have dashboards for troubleshooting like if there is a problem to look at, and then we'll also make dashboards very, for very specific situations like if we're sending out a feature flag, we may want to say like, put a scope on that. And you know, make sure if we're sending traffic to it there's not not not an increase in errors or, you know, that it's performing as well as the old system, or that kind of thing. Really, we have found it's very important to focus on alerts rather than dashboards, dashboards lead to an anti pattern that I mentioned before, human staring at screens, it's like not what you want. You really want to be able to trust that the system will let you know if there's something to pay attention to. Nonetheless, I have some quickly to run through some alerts and dashboards to show. So some of the main alerts that we care about are Kubernetes health, like it's nice to know when the cloud service provider as it starts upgrading, you know, the worker node for us. We get alerted on all kinds of application health metrics, you know we get alerted on like dead man alerts are really important so we get alerted if we stop getting a signal right then it lets us know like hey you have not gotten in 400 error from this gateway in you know some period of time that seems very suspicious right so like you know that that implies that there's something mysterious going on if we're not if we don't hear anything from a different service and then I also mentioned the external running availability tests like so if we start to get like if three times in a row and API call fails we get paged right so we know like within three minutes like if we get three times in a row in any one instance we get paged so Okay, so these just some graphs I just want to touch on them real quick these are are some of the dashboards that we use. I just mentioned external availability. This shows like hey we've had a moderate number of deploys this week. During those deploys, we had no this is real data by the way, during those deploys we had no issues. We had no like, you know, no, we didn't see any, we didn't drop any API calls, but we did hear wake way after deploys. As you can see like, over like, you know the course of a week to two failed calls that's like, you know, we make the calls every minute, that brings us below our SLA here so that's why that's read. So, so this is a good indication of system health. Overall we say we're pretty, we're pretty healthy, but we do want to look at you central with what happened there why did you use that. You can ignore this over here this is just for our own internal masochism. This is the like availability metrics that are really important. So we do check every week we have like a recurring meeting where we just go in and review. And review our system health metrics where I expect them to be. Here's, here's an example we can see over the last hour in US West, there was some read availability problems we probably want to take a look. My guess is that we actually got alerted on this. You know this probably like fired an alert. Hey, that's a leading indicator there might be something wrong in US West, because of the lower than expected availability there. So this is a debugging flat debugging dashboard like if there is an event will often open this up like we're looking at US central over the last hour. We can see this doesn't look like there's any problems right know the query pods are crashing availability looks fine everywhere. All the, all the lines are flat which means like you know there's no, there's no rogue. We haven't deployed any any pods in there. Sorry. This is the same dashboard below and we've just learned like what like shapes look like healthy systems and not healthy systems so this all says that our storage here is actually really healthy. So this is really handy if there's a problem to go in and look is like there anything obviously wrong to help steer our attention to the right place. These are actual alerts. So like you can see like query request duration, if like users are making queries and those durations start to get too high, like how long they take we get alerts. And yeah, so just I just wanted to close off on quick outro with like some of the challenges that we face as leaders like bringing the team towards this get ups model. And the first was really just like a change of mindset from engineers from releases to continuous delivery. And this was a bit deep for us because like, like of everywhere I've worked, like I've never worked anywhere that it's smarter engineers and in flux data. A lot of the engineers have been working on databases for many, many years. But the like releasing a database that customers install on their infrastructure is very different than having a SAS service. Right. So like, there was a period of time when it was hard for people to let go of the notion like every two weeks you do a release, even though we were SAS, right and so that just took practice. I was based like really a learning by doing things. Kind of with that was like changing a mindset to think about how to roll out breaking changes, like incrementally. Right. So like, if I'm going to break the contract between internal API is like how do I, like, how do I roll that out. Right. Because if you're just doing one monolithic release, you just rebuild the whole binary, the internal API exchange they all rebuilt that's no problem. But that doesn't work in a in a microservices architecture. So, coaching people through that. And then there's also just this like fear of breaking production. And like a lot like and that was just like no production will break. It's not about not breaking production. It's about knowing that you broke production, you know what to recover from breaking production and having practices that leads a fewer fewer, you know, incidents, but not being paralyzed by that fear, which causes you to batch up changes to do a release, a big release all at once which actually increases does not decrease your chance of breaking production. Okay, so next, we want to get some metrics back into the automated feedback loops as we discussed. We also have found that like our feedback loop per deploy is like after the deploy, we check if it's healthy. We need to check during the deploy, whether it's also staying healthy. So we want to do is say like if failability dips during the deploy, let's not deploy to the next wave automatically. Rather than if it's healthy after the deploy then go ahead. We are now in the process of putting our GitOps the same metrics and alerting like the GitOps itself, like how long does it take to do a deploy, how many deploys do developer, like, how long does it take to get out to get a release out. And then finally, we're starting to work on game days. Because we have reduced our level of incidents is so much that like, I'm a bit worried that we're getting out of practice of handling them. So like we want to make sure that we're like stay in practice for handling them. And then finally, as I mentioned, we're going to add canary deploy. Okay, I think I have a few minutes for questions. I'm not sure how the process goes down. Well, thank you for a wonderful presentation. We have about five minutes or so for questions. So if you have any last minute questions, please feel free to drop them into the Q&A box. We have one here right now. Can you please give an example of the exporters you have for the custom metrics? Also, can these exporters work with Alert Manager and Steam Alerts to systems like VictorOps? I'm really sorry. I don't have enough context for what they're asking to really answer that well. I don't know if they're asking about Telegraph itself or the code that we wrote. My guess is that if they're asking about whether Telegraph supports services like VictorOps and such, I suspect that it does, but I don't know for a fact. So just a matter of configuring the output plugin. And please follow up with me if I just completely whiffed your question and I'll be, I'll for sure track down the answer for you. Okay. Do we have anyone else with questions at all? Why do you need to add support for Canary if Argo already supports it by default? Yeah, that's a good question. That's like really related to our current state of using Istio. So because we're a stateful service, there's because, you know, like our storage tier is stateful, there's just some difficulties. There's just some more work that we have to do to express those as services that Istio understands so that we can use things like Argo or whatever other Kubernetes native tools for that. Okay. There's another question. I'm not sure what you mean by needing to check the health during deploy to ensure five nines availability. Doesn't rolling upgrade strategies in Kubernetes take care of that? Yeah, so we have not found that to be the case at the level of smoothness that we need. So we have found that during the ploys frequently somewhere in the chain, like either Istio or some other request or Kubernetes itself will, if we're not very careful, will restart a pod during a while we're servicing a request. Some of the requests can take up to two minutes, you know, because we can be crunching a lot of data. So in fact, you know, we can dip below five nines, but we've also had much more profound problems where like during the deploy, we just messed it up. Like, and so we, you know, at one point, like the code disconnected, like our code, we disconnected a service before we created a new service. And then we did a bunch of deployment work in between. And so during that time, we just couldn't service requests. Kubernetes was doing exactly what we told it to. But since, you know, after the deploy, our test said, Oh, everything's fine. We want to head and then roll that out to our internal production environment. And, you know, so we had like a lack of availability during that deploy during that time. Right. So if we had the ability to say, Hey, like during the deploy, there's a lack of availability, then stop it. So I hope that answers the question, like super happy to talk more about it, but Oh, thank you for your presentation and answers. I understand that deployment ordering is critical for stateful applications and its components. You mentioned custom controllers for your application is the right path forward to ensure dependency and ordering for application components. Can you elaborate on that? Is a, is it a custom operator you have written that runs in the cluster or are you, or do you have Argo workflow taking care of the ordering? Like, so in the case that I'm mentioning to get to that super smooth deploy case, it's absolutely a custom controller that we wrote that's running in the cluster for us. We have a set of controllers actually that we wrote, but that are at different levels. But yeah, we just found we just had to get very fine grained control over the different parts. And the only way that like there's just no off shell off the shelf solution that actually worked up to the level of requirements that we had. Again, most people don't have those requirements, you know, like that are that like through the roof of like, you know, a single dropped API call single 503 is, you know, like the users are versa to that. So yeah, absolutely. We had a right custom code and we continued right. Okay, well that just about wraps up all the time that we have for today. I want to thank Rick for a wonderful presentation. I said before the presentation and slides will be available later today on the CNCF webinar page at CNCF.io slash webinars. Thank you all for attending today's webinar take care stay safe and we will see you at the next CNCF webinar. Bye everybody.