 Hello everyone, thank you all so much for being willing to stay later on a Friday and talk about deployment practices. So by a show of hands, who here has had a bug that has made it to production that caused an incident? Okay yeah, pretty much everyone. Now keep your hand raised if following that incident you thought, hmm that took a little bit longer to fix than I thought it should have. Okay yeah me too. So I want to tell you about one specific time that ended up sending us down a multiple year journey to improve our deployment practices. This was a few years ago when we were deploying a new version to cloud and we had spent the morning manually building it then the afternoon manually deploying it and then bam alerts start firing escalations are coming in. It turns out that the new feature that we had deployed had a problem that had a more cascading effect on the system. So the evening we spent rolling back the new version. The next morning the developers come in and they start working on a fix and we think we have one. So again we spend the morning manually building it then the afternoon manually rolling it out and then bam. More alerts start firing escalations are coming in. It turns out that the fix worked slightly differently locally than it did in cloud. So again we manually roll it back. At this point our users are understandably getting frustrated because with that release there had been bug fixes that some of them had been waiting for. So understandably frustration was growing amongst the users. The next morning the developers come in we figure out what the difference was between local and cloud, patch it, ship it to cloud and finally the incident is over. But we still had a multiple year journey ahead of us to improve our deployment practices. So we're here to tell you about our journey in hopes that you all can get some better sleep while on call and your users can experience a more smooth release process. So hi I am Stephanie Hinchin and I am a senior software engineer at Grafana Labs. Hi I'm Michael Mandras and I'm also a senior software engineer at Grafana Labs. And this is our guide to saying and save cloud deployments. So let's look back at where we were at the beginning of that incident. At the time we had our priority towards on-prem releases. So we were building our on-prem releases and then shipping those manually to cloud. Now those manual shipments to cloud were pretty tedious and so we only did those monthly. And since we only did those monthly and had 100 plus engineers working on that software those were pretty large changes that included features and bug fixes. So when we had a problem we had to roll back the new features along with all the bug fixes that have been shipped. Luckily that's not where we are today. Today we have automated deployments that are going to cloud first and that benefits our cloud users because now they're thought of as first party citizens in the whole release process. But it also benefits our on-prem users because now they're getting releases that have been proven in cloud over that month. We also have flexible deployment schedules in cloud that go from either hourly internally to all the way to monthly. And then we also have the feature releases separated from bug fixes. So we're not rolling back bug fixes as often. Now this of course did not happen overnight. This was a three-year journey for us. So we started out the first year doing decoupling of features and bug fix releases. Then we spent an entire year doing a lot of developer enablement and then followed it up with a lot of automation on our cloud robots. So the first thing we'll talk about is separating features from bug fixes. As Stephanie was just talking about, this is bad for developers. It's bad for users. So we had to come up with a way to separate these out. What we ended up going with was feature toggles. You might have also heard them called feature flags. But they're essentially just Boolean gates that activate or deactivate certain code paths. This is not something we came up with. But at the time in 2019, we did start building this from the ground up. And it's evolved over the years. And after a lot of iterating and getting feedback from developers and dealing with incidents and whatnot, we finally figured out a feature toggle definition that works well for us. So first of all, we have a name and description. So we want it to be very easy to figure out what a feature toggle is doing. Then we have the feature toggle stage. So these range from experimental, which is just a beginner feature. We're trying something out. It may never see the light of day. And we have private and public preview, which are different groups of users starting to use the features. You can think of these as alpha and beta if you want. Then we have generally available, which means everyone can have the toggle and the risk is minimal. And then finally deprecated means we decided not to go forward with the toggle and we're removing it. We also have the feature toggle owner. And this was something we didn't have initially. We added later on. But this has been really important as it makes it really easy to track down the right team if you have questions or issues. So we definitely recommend adding that. So once you have a feature toggle defined, oops, sorry, once you have a feature toggle defined, you'll want to start calling it from within your code. We have a couple of examples here, using them on the back end and front end. But as I said before, these just activate or deactivate certain code paths based on a Boolean condition. So in 2019, when we started building this, there weren't really too many options for this sort of thing. But nowadays, there's a CNCF project called open feature. We've been thinking about trying this for a while, but we just haven't had the time for it yet. So if you do start to explore feature toggles, we definitely recommend looking at open feature. And if you have any lessons learned from it, we'd love to hear from you. All right. So now we have feature toggles defined and being used in the code. The next thing we need to do is define a way to roll these feature toggles out to different cloud environments. So we ended up building something fairly specific to Grafana's internal SaaS infrastructure. But essentially, you just add a minimum, name the feature toggle, and then the minimum version at which that feature toggle can be rolled out. We've added additional filtering mechanisms on top of this as needed. So you can roll out to different customer groups if you want. You can roll out to different percentages of users. So this just lets you gradually roll your toggles out over time, do a little bit of sanity checking before increasing the toggle rollout into more users. Our one recommendation here, well, we have a lot of recommendations, but one crucial thing is if you are using a percentage-based rollout, just make sure you're randomizing which users are getting those changes each time. You don't want the same 10% of users testing all of your feature toggles. So make sure you vary it. Oh, the slide's animated now. All right. So once you have your feature toggles defined, and you want to start thinking about what your rollout practices will look like. So we recommend a progressive rollout starting from internal instances and going eventually to production instances. So we start out by rolling our feature toggles out to our dev instances, either just internal instances that Grafana users can start up and test with. Then if those look good, we roll out to our staging environment. This is also an internal only environment, but we have our most heavily used internal instance on it. It's what we use for many of our operations, all of our on-call and incident data tracking, monitoring our infrastructure. So a lot of people use this instance, and if there's something wrong with your feature toggle, there's a good chance it's going to be caught in the staging environment. Then finally, if that all looks good, then after a period of time, we'll recommend rolling out to Canary, which is a small percentage of end users. Then eventually, if that looks good, we'll roll out to production, which is the rest of our end users. Lastly, you'll want to think about when your developers should actually use feature toggles. This is actually the hardest question to answer. We still haven't really come up with a strict set of guidelines for it. It's more about educating developers and making sure that they can make the right decisions, providing them with a lot of data so that they can base their decisions off previous use cases. One important thing to think about is too many feature toggles versus too few. Using feature toggles for everything sounds great, but if you start to have too many, then you can get into some spaghetti code situations. Also, a lot of feature toggles may interact with each other in ways that you don't expect. You don't want to go completely crazy with them. On the other hand, you don't want to have too few, or you just start getting back into the scenario we had before where we just had too many features bundled with bug fixes. The other challenging thing is defining what a feature actually is. For example, a bug fix that is very visible to users, noticeably causing an issue, most likely not going to go behind a feature toggle. It's just a bug fix. Just roll it out. If it's a big new feature that has a very large visible user impact, that most likely should be behind a feature toggle, but then there's this whole other category of efficiency improvements and refactors. This just comes down to the use case again. For example, if you're doing a refactor and you find yourself referencing the feature toggle over and over, maybe it's an indication that you're starting to get towards spaghetti code and you might want to consider a different approach. Our main recommendation here is just be flexible, make guidelines, not rules. You don't want to frustrate the people who are supposed to be using this, so just make sure you're constantly working with developers to make sure they understand what feature toggles are for and when they should or maybe shouldn't use them. All right, so a couple more lessons just to save a little bit of time for everyone. So we already talked about feature toggle ownership. It's very important to know which team is responsible for feature toggle, so we recommend adding that in early on. Feature toggle life cycle is another one that we've only started thinking about more recently as our feature toggle registry has started to grow. You want to start thinking about early on when should feature toggles actually be removed? For example, if something's been in GA for a year and it's been enabled for all users, then at that point probably it shouldn't be a feature toggle, it should just be the main code path. So thinking about life cycle and setting up some sort of automated reminders or just recurring meetings to revisit feature toggles is something we would recommend. And then finally, chaos testing. As I mentioned, a lot of these feature toggles can interact with each other in unexpected ways. It's very hard to catch every issue before things make it out to production, so we would also recommend setting up some sort of randomized feature toggle rollouts and doing tests that way. So in summary, decoupling feature toggles and bug fixes is better for developers and users and for your releases. You want to make your feature toggle framework convenient and easy for developers to use and make sure that you're making guidelines not rules and taking into account developer needs as you go. Cool. So once you have the feature toggle release framework done, the next step is to work on empowering developers to feel comfortable utilizing that tool, but also feeling confident in deploying to cloud. So this chapter we spent a lot of time doing developer enablement. And what that all looked like was knowledge sharing mainly. We needed to share the why we needed this change, the how, and also make sure that the knowledge sharing was bi-directional and we were getting feedback. So for the why, by asking developers to take on more responsibilities of deploying their features to cloud, we are putting additional load on them. So it was really important that they were able to understand the why behind it and what themselves had to gain as well as the users. So we boiled it down to three key reasons. The first was happier users, since we no longer were shipping out features and bug fixes together and then halting to roll back bug fixes once we had rolled out a feature that had a problem, our users were experiencing a much more stable environment and therefore were a lot happier. Additionally, for developers, they were experiencing a lot less urgent feature fixes because now if we had an incident with a feature, we could just roll back the feature toggle rather than halting to roll back the entire version and then having their fix be dependent on these bug fixes. Lastly for the developers, they could have quicker feedback loops. Previously, they had to wait until the correct December version to make changes on their features or release their features but now they could release their features behind a feature toggle in cloud, have it even GA in cloud before it ever got to an on-prem release. So once we had boiled down our main reasons why, we continued to reiterate it so that we were all working towards the same goals and keeping those in mind. So while your why's may differ, the key thing is really to boil it down to a few key reasons why and then share them and be transparent. Next was the how and this was not just how to use this feature release framework we had built but it was also how do I feel confident in deploying to cloud, particularly for developers who haven't gotten the chance to do that before. So the first was to make sure that they had a good production-like environment that they could test in. So at first, we just wrote a guide of here's how you can do this but pretty quickly we needed to change that and add in frequent main builds so that they could test the newest changes that they had had. Then we eventually decided to build on that even more and create what we call ephemeral instances but what you do here is you comment on a PR and it triggers a GitHub action in the background, builds that new code and then creates a cloud-like environment for them where they can go ahead and test that before it ever gets merged into main. So this is a great way for developers to be able to test their features and make sure that they're working as expected in cloud but outside of that it's hard to know what is this feature having an impact on in cloud and on the software and so what we ended up building was this feature readiness review and what this is is a dashboard that goes through different steps of things that they can look at. So the first step is just that guide that we had seen before of here's how you enable it on your instance, test it, make sure your feature is working but then the next step is here's how to look at the software as a whole and so this is looking at what kind of crash loops are happening on the pod itself, any pod error logs that we're seeing as well as API latency or database query latency or database errors or API errors. So once they see that their feature isn't making an impact on that the next step would be to look at the resource utilization and how that will impact cloud as a whole and so here you can see with the vertical blue line that is where I've enabled a new feature toggle cube con demo and at the top you can see that the memory spikes up a little bit afterwards but it evens out and then that correlates to no increase in cloud cost. However the CPU starts spiking up quite a bit afterwards and that ends up correlating to $187 of difference in cloud cost. So they can use this see okay this is not as efficient on CPU as possible let me see if I can make some changes and decrease that number and now I'm going to go through the promql query that's behind the $187 because it's a little complex and so I want to save you some time so this is the pseudo code for it and the first section of it is just looking at the difference in memory or CPU utilization of your feature. Then you're going to multiply that by the cost per unit for the cost per gig or the cost per core and then you'll multiply that by the scale so the number of replicas you're running across how many clusters. So diving into that what that looks like is you're taking the average container memory usage bytes and then you're doing it unless the feature toggle info is zero or unless the feature is off. The feature toggle info is a specific metric we added to our software so you'll have to add that as well to emit metrics on when a feature is on or off. Then you'll subtract it with unless the feature is on and you'll get or unless the feature is off and then you'll get the difference. Then you're going to multiply it by the cost per unit and so what that looks like is taking the cost of the namespace and then dividing that by the amount available in our node pool. So one thing to note here is we run a specific node pool for our workload and our workload is isolated to one namespace so that works for our workload. If that is not the case for you you can get an approximation by taking off the namespace label and the node label to get an average of the amount that it costs for a gig in your cluster and then you'll multiply it by the amount of replicas in all of your clusters. So all together this is what it looks like. The cost per month actual if you don't have that already our colleagues at KubeCon North America gave a talk on how they implemented that metric and so at the end of the presentation we'll have a QR code that goes to their YouTube video if you want to see that. So then finally they are ready to deploy it and the last step in deploying that we have is this production readiness review and this is just a quick last check of yes or no questions that are either cloud specific or company specific. So Grafana is an observability company so one of the questions that we ask is does your feature emit metrics but yours will vary based on what's important to your company or different things that you have experienced in cloud before and so customize it to fit that and then the developers are ready to release and use the feature toggle framework. Now since this was an internal framework that we did we tried to supply as many possible ways of learning about it as possible so whatever's easiest for people they can learn that way so we had a course walk through with guides presentations with Q&A all of those things and then although this is at the end this should be done throughout you should be soliciting feedback and asking the developers what are they uncomfortable with if they haven't had experience in deploying to cloud before what makes them nervous and how can you help them see that their feature is ready for cloud and feel confident in deploying to cloud and also one mistake we had made throughout the process is we assumed at first that this would mainly be used by back end engineers and so we talked mainly to the back end engineers but then it ended up having a lot of use cases in the front end and so then we needed to go back and rework things and help them too so try to talk to as many people as you can and as many roles as you can and figure out what their experience is and what they're nervous about so in the lessons that we've learned in this chapter first is to start with the why when you're asking developers to or anyone to make changes naturally we want to know why we're making these changes so that we can get behind them so start with the why make sure that you have it into a core few reasons then make sure that it's easy to test and to learn about cloud we strongly believe that convenience will drive adoption so make it as convenient as possible and then finally make sure that the knowledge sharing is going both ways the entire time okay so at this point we have future toggles we have less risky deployments so we can finally start deploying more often this was a goal of ours for a while but we weren't really able to do it because of the state of our releases before it would have been counterintuitive that more often releases is riskier but at this point we've actually minimized that risk as our releases are a lot smaller so the technique we use for our rollouts is something that we call rolling release channels this was inspired by the google kubernetes engine release channels but the concept is that you just have different channels that receive software updates at different intervals and the channels on the slower end of the spectrum get only validated releases from the earlier on channels so in this example here we have a we have four channels one that we call instant running on a weekly cadence one called fast running every other week one called steady running every three weeks and then slow as every four weeks and so we can see in the top row instant is deployed for a week it then because it was okay for a week it goes on a fast and so on until eventually we get to slow so by the time changes make it to the slow channel they've been running in cloud and being validated for four weeks we have quality gates along the way that we use in order to detect bad builds and we also have a way to manually mark builds as bad but the way this works is we have automated tests that run in each channel that we have we have a couple of test instances and we use the grafana tool k6 to do load testing and yeah just run a test suite against each of those instances it's hard hard word to say instances if we catch any issues with those we'll remove them from the pipeline immediately you can see in v3 instant we detect an issue so we pull it completely and it's v2 that gets promoted to fast instead of v3 as a result i talked about yeah validating your versions and some of the quality tests quality tests were running like i said we're using k6 we also just have a way if if anyone's testing and something managed to slip through the cracks and we have an issue we can immediately mark it build is bad and completely remove it from the build pipeline so once you have your release channels you want to think about your user distribution on those channels and we recommend doing something like this we again drew inspiration from google on this so first we have our our instant channel is all internal instances so this is pretty much as soon as the developer commits a change they can start testing it out on a grafana instance that's on the instant channel from there once we start rolling out to users we prioritize or replace users in different categories depending on their needs so users on the fast channel might be more interested in getting new features sooner and testing them out and they're okay accepting a good amount of risk whereas on the slow end this is more reserved for enterprises who can't really afford that risk and just want the most stable proven features and then a bunch of people in the middle so yeah we follow the bell curve we recommend the same and then finally we want to add observability into this so it's developers who are going to be using this and caring about it so we want to make it easy for them to figure out what's going on with release channels at any given time so given that we are grafana we did this through a dashboard so we just have a big dashboard living on our big internal instance that I talked about earlier just lets you see the state of each release channel which commits deploy where and we have other tools that let developers figure out when their commits are going to make it to certain channels so yeah summarizing lessons learned here if you've completed chapter one and two in our little story here it'll make deploying less risky all right yeah internal usage dogfooding very important grafana has a strong culture of dogfooding not my favorite term but essentially means you're testing out your own software extensively and then yeah if you take anything from this talk it's work with your developers they're building the thing make sure you're soliciting feedback from them and factoring in their needs not just your own and that's pretty much it so we are still on this journey of course we're not perfect we're still trying to improve things but if you happen to want to explore a path like this we're definitely looking forward to hearing any of your thoughts experiences and we'll continue to share ours that's it any questions so if you have any questions I think there's a mic that will be run around to you so just raise your hand oh okay sorry one over there thank you hi thank you for the presentation you said at the beginning that your deployment were flexible your deployment releases you are making them as flexible as possible but at the end you presented four different channels so I have just two questions like does the developer have to fit in one of the channels or do you have also maybe other like critical deployments or a thing like that so this was specifically for how we release Grafana but there are different softwares that we release so they can follow their own deployment schedule from there um yeah in the example that we gave to it was uh every week cadence we actually have hourly um daily weekly and monthly it was just a easier way of showing the concept by going by the week by week but thank you okay cool yeah thank you