 Hey, hi everyone, my name is Steph and my co-presenter Martin Pitt and I are going to be talking about error budgets today. Specifically, how to apply them to your open source project. I wish we could be there in Brno altogether. I have such good memories and I'm imagining a room full of all the cool people. And us celebrating together, but alas, we can't. This is virtual and but I'm yearning for those days again. Oh well. So error budgets. Error budgets are usually applied to services and operations. They're something that measure the reliability stability or latency or other expectations of a service. However, it's a really useful technique and I want to share how to how to apply them to an open source project. So why would you do such a thing? Why would you use error budgets in an open source project? What problem are we actually trying to solve here? Have you ever felt that the maintenance of your open source project was becoming overwhelming? Like there's too many bugs piling up in your project or too many issues being filed or CI was way too flaky to be useful. Maybe your infrastructure that you rely on for packaging or for testing is constantly falling over. Too much code review going on. These are all actually signs that your project is successful. Most projects don't have this kind of problem. It's a good problem where you have a lot of action going on, but it can be overwhelming. So how do you know what to focus on? How do you make sense of this? And that this is where we use this handy methodology error budgets to get out of this overwhelmed state and know whether what you're feeling is normal. It's okay. It's not a problem yet. We can survive or if it's out of control and you need to pay attention and focus on it. What error budgets come from this concept called site reliability engineering. It's a methodology that was pioneered at Google and error budgets are a foundational framework in there. You can read all about it in the in the SRE holy book from O'Reilly here. I actually have it on a shelf behind me. But you don't need it in in dead tree form. You can read it online. Check this link here. You can see the entire book here online. And it's a long read, but the main one, the main chapter that's interesting for error budgets is this one called service level objectives. And I'm not going to read you the whole thing like a bedtime story, but we'll go through some basic concepts and give you a quick briefing. And then pity is going to show us how they applied it in the cockpit project. Error budgets are a budget of how often a given event, usually an error is allowed to happen in a given timeframe, usually 28 or 30 days. And if that amount of that event, that error or that state is exceeded, then you stop other work and you focus on fixing the root cause of whatever caused that problem. This all starts if you if you're trying to work with error budgets, this all starts with a level of service. You have to ask people ask your users, ask your contributors, ask the people who are experiencing your project, what their expectations really are. Here's some examples of levels of service for a project. The project is not stable for users if bugs are constantly being filed as issues, and they're piling up in a big backlog. Users expect a stable project and this is how we can tell if the project is not stable enough. Secondly, a different experience is users should rarely see test flakes, that is to say false positives in the CI testing on their pull request. Or another example, a contributor should typically see their pull request reviewed in a day or two. Now these are experiences from a user's perspective or a contributor's perspective, and they're not really something you can act against or measurable, they're very hazy. So the next step is to take these and make them concrete. The concrete form of a level of service is called a service level indicator. These are measurable things. How would you measure these three different experiences? Well, the first one, we could measure it by using the GitHub or GitLab API and figuring out how many open non-RFE, that is to say, not feature enhancement request issues, have been filed and are still open, sitting there. These are bugs. For the second one, we could measure one of the percentage of all the pull requests that got merged without a test retry, without someone manually retrying the tests or asking the CI to retrigger it. And for the third one, we could measure the age of the open pull requests that have no review. Okay, so we have ways of measuring these things. Now we need to figure out what is the acceptable range. And those are called the service level objective or SLO. These are the targets that you think make sense and that the user who has the expectation for that experience would agree with. So for the first one, again, we might say that more than two and less than 100 open bugs in the last 30 days is acceptable, but anything more or less than that is not acceptable. This one's actually interesting, there's a little twist in here, because obviously if you have, you know, these numbers are arbitrary for this project, our fictitious project here. But if you have a lot of bugs being opened really quickly, there's an indicator that something has gone wrong. We need to figure out what it is. But also if no bugs are being filed, that's suspicious too, and that's worth looking into, like is the bug tracker not working? Is anyone using the latest release? Did someone fork your code and all the users went and using the fork? Like that's worth figuring out what the hell is going on. So here you can see this SLO has an upper and a lower bound. And in that range makes sense, outside of it doesn't. For the second one, the percentage of pull requests merged without a test retry, we might say 95% is acceptable. Every once in a while, the reality is you're going to have to retry the test for some reason or another, maybe dependency has a problem or something. So we set a reasonable objective here, and this is important. When you're defining error budgets, 100% is almost never a correct value for an SLO. Why? Because a typical contributor can't tell the difference between nearly perfect and perfect. And if you aim for absolutely perfect on your SLOs, you're going to waste way too much time and energy on exactly this, and they won't have the intended effect. They're essentially, they're noisy, they're nagging, instead of actually being genuine and representing reality. And lastly, we can see the SLO for this one. We expect that the open pull requests have some form of view in less than two days. And that is our, that's our SLO. We measure that, we can measure the age of an open pull request that has no review. And we can say that under two days is acceptable, over two days is not. So let's look how this actually plays out. Imagine we measure these over time. And what is an error budget? An error budget is that measured SLO over a rolling window. So I think I dropped the third example here. The first example, we can see here we have a graph of over 30 days of the number of issues opened each day. And well, our SLO said less than two over in 30 days and more than 130 days was outside of our error budget. We've used up our error budget here because over the last 30 days we had 254 issues filed. So very likely sometime in the middle of the month here, we already used up our error budget. And we should have taken action. This is, this is an example of why the error budget has been exceeded. And in this case exceeded dramatically. And in the second case here, this is number of tests that were merged without retries. We can see that on average over the last 30 days, 95.1% have been merged without retries. So we've used up most of our error budget, right? Because our goal was 95. But we're still okay. And in fact, if as long as we keep this trend going, where most tests are stable, as you can see there above 95% stable, then we're going to be fine. But we should be ready in this case to take action. But we don't have to take any action yet. So I hope that makes sense. That's a whirlwind tour of how to get to an error budget. But then the question is, what do you do in this first case when the error budget has been exceeded? Like, okay, that's interesting. Now what? Let's talk about that. When you're getting close to exceeding your error budget, or it has been exceeded, you should stop all other work and focus on that error budget. So you stop working on new functionality, your branches, your pull requests, merging new features, accepting contributions that are new features. And you triage, what is the root cause of that error budget being exceeded? You ask the question and try to find out what is going on here. How can we change this so that it doesn't happen again? For example, in the case, in this case, we had way too many issues being filed. And that's indicated that the project is not stable enough for users. So we might ask, is there one broken thing that we merged recently that broke everything? And it's the root cause of all these bugs being filed should do some work to figure that out. Or do we have enough unit integration tests to catch regressions? Maybe every time we merge code, we break something, something, a little something here and there. And we really need to invest more time in our project and check before merging a pull request, whether it's tested, whether there's a CI test or a unit test that covers it. Or maybe it's just all gone to pot as a one-time event. And we need to spend some time fixing all the issues. And we need to set aside two weeks to stabilize the thing. Who knows, that's possible. What if too few bugs are being filed? Well, is the bug tracker being broken? Is nobody using our project? We talked about this before, but it's also worth investigating and figuring out what the root cause of it is. Then, based on the root cause, figure out what you're going to do to change the error budget. Make a decision. If you're the only person in the project or the maintainer, make the decision yourself. Or work with the other co-maintenors, the team, have a little discussion. Figure out, okay, here's what we figured out and talk about what to do. So you might fix the root cause. You might spend time fixing bugs, add tests and so on. But until that action is completed, you don't merge pull requests that are unrelated to this action. You only merge the ones that are related to this action. You push out your goals and milestones. Maybe you wanted to land a big feature and so on while you delay it realistically. You don't have a project that is healthy by this indication, by the error budget's indication. So you should delay that. And you have to also make sure that the requests that you pull requests that are coming in from others abide by that same thing. So you have to communicate with others that we're in this state. If you do have a time-based release, like maybe you release every two weeks or every month or every six months, you still do your release. And it will include simply less features or less work that's unrelated to this root cause, more fixes, more stability and so on. So what are the benefits of using error budgets in this way? While you remove stress from the teams and the maintainers working on the project, because you don't have to second-guess yourself. Is the project stable enough? Yes or no? Do I need to spend time on code review or triaging issues or fixing the CI system? Yes or no? You don't have to double over on this. You can actually focus and that reduces stress. It's an indicator of health for your project. Like we said, you can communicate it to your contributors, to your users where you're at, and what you're investing in. And that increases a lot of trust that people like working in a place that's predictable and they can tell what's going on. Or they're like using a project that communicates this kind of thing to them. And contributors are no longer frustrated. They get the expected experience. Things like CI works better. Issues are addressed and so on. And of course, I use three examples here. You can come up with your own. There's many of these that error budgets could be applied to. I keep harping on the same ones, just as examples. And the users feel engaged and heard. So consequences. Well, you have to be rigorous in order for this to work. In the error budgets as described in the SRE handbook, there's two teams. There's the SRE team, engineering team. And they measure the error budget. And the SRE team, which is responsible for operating the service, the reliability of the service won't accept changes or new features from the engineering team while the error budgets are exceeded. So there's like a natural cross check there. Whereas if you're one team working or one person even working on a project, you have to be rigorous yourself. You have to stick to it. You have to measure it and actually act on it. And so it requires a little bit of diligence. It may delay feature work or new functionality in the project. Yeah. No surprise there. And it forces you to actually take the time to focus on technical debt or overwhelming backlogs. And it also indicates when you should stop focusing on technical debt because it's good enough. There's endless bugs and endless problems and endless backlogs in your project and you can be lost in there forever. This helps give you a heads up saying, okay, we did it. We're good enough now. And we can focus back on contributions or new functionality that you want to merge or other things like that. Cool. So I'm going to hand this over now to Pitti, who has an excellent dive into how this is applied in the cockpit project to do with infrastructure and test flakes. Take it away. Hello everyone. I'm Martin Pitt. I lead the cockpit team at Redhead. So first of all, thanks a lot Steph for the introduction about aerobudgets. And I want to explain how we apply these principles to our cockpit project. When I first heard about this concept, aerobudgets did not really seem to apply to our project. Cockpit is a software product after all. Users install it from the package and use that. We don't provide a service to run cockpit for them. However, we do use web services internally, namely the machines and an OpenShift cluster to run our tests. And we crucially depend on that infrastructure. And for that, service level objectives and aerobudgets very much do apply. And it's a little easier for us because this is obviously an internal service. So we are our own provider and customer. And we have a tight feedback loop. And so we don't need to play any blame game. In the past we had long phases of slowly deteriorating tests for unstable infrastructure. So we got used to hitting the reach by button a lot until stuff passed. And of course, this is frustrating. It makes it really hard to land stuff. And we got afraid to touch code with no unstable tests. And more importantly, it also hides real world problems. While many bugs are on the tests themselves, in a lot of cases, these failures actually show bugs in our product cockpit or its dependencies, the operating system that it runs on. And so we also did not have any systematic prevention of introducing new unstable tests. So occasionally we did a clean up some mess print, but it was always a bit hard to know what to look at first, like where are the most pressing problems. And the first realization was that a test always passing is not unattainable or even a good goal, even if you ignore flaky infrastructure for a while. And this is the reason this test complexity. The numbers here give you some idea about how many moving parts are involved for testing a cockpit PR. And you have to add to that the unreliable timing due to noisy cloud neighbors or cloud nodes doing something unexpected or erroneous or the tested operating systems, they just tend to do stuff in the background. But we have to differ between bugs in our own product and tests. They are under our own control. And these are the ones we need to fix. And then bugs in the operating system. These we need to investigate, report and track and then skip hours when they happen. And finally the failures of infrastructure. Like for those which are really justified and also unavoidable to some degree. So we wanted to become more systematic and objective about all this in order to get us out of this hole. So we wanted to define goal, what keeps us happy and productive. And then we find a budget for how much failure we are ready to tolerate. And then we translated those into service level indicators and objectives which then drill down into the specifics. Then of course you need to implement the measurement and the evaluation of these indicators. And most importantly we needed to define a strategy how we deal with test failures in a sensible way that they don't treat all the failures in the same way. So we've met with our team to discuss what keeps our velocity and motivation. And pretty much everyone agreed that these are three main things. Pull requests need to get test results reliably and they need to get validated in a reasonable time. And failures in these tests need to be relevant and meaningful. Humans must not waste time on interpreting unstable test results to figure out if they are unrelated or relevant to the change that they are proposing. And we must not be afraid of touching code. So we've written down these goals on our public wiki page. The link is on the slide. So this gives us some commitment to them. And after that we formulated service level objectives to define what we exactly mean with these goals. So on the same wiki page we have six service level objectives and they define the measurable properties together with an objective that implement aspects of our goals. So the first example here describes test reliability. So that we don't want to try pull requests too often unnecessarily. And the second one applies to infrastructure reliability. We have four more as I said, but they are just different aspects and they don't really introduce anything fundamentally new. Fortunately almost all of the required data can be derived from the GitHub Statuses API. So this is a machine readable API that gives you the whole history of what happened to all the tests in the pull request. You can see the initial status here when a PR just gets submitted. So nothing much happened yet. We just know that we have a pending test request and we know what it happened and we create it at timestamp. And once a bot picks up the pending test request it will change the description to in progress and attach a target URL so that you can follow the north. And it will also create a new created at timestamp. And the time delta between this created at timestamp and the previous status with number zero that gives you the time that it's spent in the queue. So this is exactly for computing the first SLI that I mentioned. And once the test finishes the state will change the success or failure. So here in this case it's a failure. And as I mentioned the status API remembers the entire history. So if we can read the history there when we see that a failure goes back to in progress and eventually success we can deduce that this was a retry and tally it accordingly. For that we have a store test script which reads and interprets this history for merged PR and puts it into a SQL lite database. So the link is on the slide. And we also regularly do SQL queries on the database to calculate the current values of the indicators and export them in Prometheus text format. And then we have a Prometheus instance to regularly read and pick up the current values and store it into its brain so that we have a whole history of these indicators. And then we have an accompanying Grafana instance which graphs these SLIs and objectives in a nice way. So we can move around in time and investigate problem spots more closely. Again the link to Grafana is on the slide, it's public. And you don't need to be concerned about the details here it's just to give you a course impression. But one important detail are the red bars. They show the service level objective that means where the indicator exceeds the expectation and it starts to eat into our error budget. So this is interesting real time data but it's not a sufficient view for how much of our error budget we used up in the last month. And for that we have another set of graphs which shows the error budget usage of the last 30 days. Again link is on the slide. So for example this is the budget of our first objective about merging a PR without or with Retrise. So this doesn't mean that 80% of pull requests were retried it means that our error budget of Retrise was 25%. So 75% of pull requests get merged without Retrise 25% are now with Retrise and of these 25% we used up about 80% of that margin. So we are still good as per our own goal but judging the slope here we will most likely exhaust the budget in the next days so we will probably need to take action soon. And this is the budget for the other mentioned SLO about the queue time. So most of the time this is really fine but it completely exploded when the Westport data center went down. So that data center hosts our main workload and it's the only place which can run rather internal tests there is no permanent fallback for those. Normally when this happens we just spin out the fallback in AC2 but as you can see from the dates here this happened right at the start of the end of year holidays and since nobody in our team was around to do work nobody cared much and depending pull requests were just automated housekeeping pull requests which were found by the bots and they were neither that interesting nor urgent. So finally I want to drill down a little bit into how we handled individual tests. The high level goals of don't we try PRs too often and so on these are really emergent results of the hundreds of individual test outcomes that come from each pull request. And as I explained before we can't expect a 100% success rate due to this random noise mostly. So we introduce the concept of an affected test that means if a pull request changes the code which a test covers or it changes the test itself we call that test affected. And we introduce an automatic retry of unaffected tests so that they get retried up to two times and it just has to succeed one out of these three. That is the courage in that equation and it made our lives dramatically better because that's essentially the bit that takes care about all these weird random failures that are just noise. However this is not sufficient because with just this approach you would quickly introduce new flaky tests and overall your quality would quickly go down. Soon enough the test would be so bad that not even three which wise would be enough. So we need to introduce the counterweight, the stick and that is that affected tests need to pass three times in a row and that has shown us to be very effective in preventing the introduction of broken tests. And thirdly we also need to track tests which fail too often. Of course there's always some base failure rate of a few percent due to the noise but that random noise should distribute evenly across the tests. The ones which are interesting are the ones which fail more than 10% of the times because they are the ones that are breaking pull requests with automatic retry and they fail a little too often to explain them away with just random noise. So these are the ones that we need to investigate and fix and also at any given time this list is very small so we can drive it to zero and we know exactly where we get the most bang for the buck. So where are we now? I'm pretty happy with this overall I must say in the last poll in our team everyone said that they are not feeling blocked by or scared of pull requests and tests anymore and productivity and turnaround is really good right now. The main missing thing is that we need to add the notification or escalation from Grafana once our budgets are decreasing and too close to the limit or even above it. So right now it's just me looking at these graphs every now and then. Another important point is that we need to regularly review and adjust these objectives to our current feeling of happiness. The goals might need to get tighter for example if we figure out that we are still not happy about the number of retries that we have to do or possibly we also need to relax them. We had a case where an objective was too strict we violated it all the time but in reality nobody cared or even noticed that something was wrong. And it doesn't make sense to spend time on fixing that thing for an artificial goal that nobody cares about you adjusted the goal instead. And finally we also need a more formal process of going into the our budget fixing mode once so that means to announce it to other teams and having a better mindset about it. Right now it's still a bit too much ad hoc. And then of course there's lots of other things which we could do. For example we might want to set up an automatic fallback if our main data center fails as it did over the holidays. So normally this is just a single answerable playbook and the fallback is expensive in dollar terms. So it is not automatic and given that we have to do this this is a conscious decision which we need to take like how much do we want to value some human control against ruining our statistics. But at the end of the day the statistics are just a tool we don't... we still can do decisions given on that. So if you have deeper questions we welcome you to join our chat and talk to us and hash cockpit on RSC. And here's also a link to our homepage which has pointed us to the main list and documentation and knowledge about how mobile bots works. Thanks a lot for your attention. We still have some minutes for Q&A now.