 Hey everyone, is the volume fine? Can you all hear me? Good, OK, thank you. Welcome to a talk on wildfires, firefighting and how we keep Kubernetes releases reliable and on time. I am Navarone. Hi, I'm Madhav. We both work at VMware. So before we begin our talk, some golden rules from the community, please follow the code of conduct, which essentially boils down to please be kind and respectful to everyone and don't do or say anything that you might regret because everything is being recorded. I do it almost every day in my community meeting. So for the virtual audience, we do have closed captioning. So if you are watching it online, please do enable that. Now, who are we? I'm Madhav. I work as an engineer at VMware. I'm also a technical lead of SIG Contributor Experience. If you don't know what that is, it's a special interest group within the Kubernetes community. And yeah, Navarone. I'm Navarone. I am one of the Kubernetes Stating Committee members. And I am also a Kubernetes SIG Contributex chair. And I do a lot of Kubernetes security stuff. Before we start, to whoever uses Kubernetes, please start using registry.kts.io. kts.gc.io is frozen now. And it would soon go away. You won't be able to access the registry anymore. I mean, we have been talking about this probably 1,000 times in the conference. Before the conference, I hope you have already moved out. But still, good thing to call out. Now, what's the agenda? We will show you how Kubernetes religious happen. Then we will go to set some context on why we are giving this talk. We will talk about what goes right when we try to fix fires in the community, what can we improve, and what you can take away from this talk for your projects. So to start with, how is Kubernetes released, when is Kubernetes released? So we release Kubernetes every four months. We do approximately three releases a year. And it's a very elaborate song and dance of people and processes. And there's a big mix of both. Now, about the people. We have a really big yet small team. We are always short of hands. But we do have some verticals in the release team who handle some specific areas of the release. The release lead is basically the coordinator for everyone. Branch manager cuts tags. The buck triage team does triage. Say signal team takes care of our CI, comms team writes basically the blocks that you see at the end of the release, the feature blog, the release blog. The docs team handles all the docs that users consume. The enhancements team takes care of or they ensure that all the features that are going into the release have all the requirements set. And the release notes team essentially prunes all the change log into beautiful release notes that you see at the end of a release. Now, about the process. It is scary. Let me try to make it easy for you. Well, it is still scary. Let me try to make it easier for you. So in this talk, what we are mostly concerned about is basically three teams, enhancements, buck triage, and say signal to understand how the release process works. We don't need to talk about the other teams right now. They are essential to the release. But for our talk, these three teams are the most relevant. So enhancements. This also talks about in general how the release is structured. At the start of the release cycle, we begin collecting features that should be delivered in the release. Beyond, let's say, a month, we freeze the requirements. We don't let anyone add new features. This is to ensure that we do what we sign up for. We don't overdo things or we don't sign up more and then burn ourselves out. Then we have code freeze, where the code is frozen. You can't commit anything beyond that period in time. And this gives us essentially a good amount of time, almost a month or five weeks, to essentially test the releases and ensure that the release is reliable. In between, we do have test freeze. This is given so that after committing the code, if someone has missed any test or to make the CI more reliable, we can add tests. Although if you have any burning issues, we don't follow the freeze because we want our product to be reliable. We will come to some of the fires later on. We will talk about two stories. Now, in the diagram, we will focus on one part, which is the Bucktriage team. So Bucktriage team tracks all issues and PRs that are in KK, basically the Kubernetes slash Kubernetes repo across time. And they escalate issues and PRs if they feel like they should be taken care of urgently. We do have some labels that mother will talk about. And during the end of the release, towards the end, we ensure that attention on the release blocking issues are given adequately. And people are assigned who are fixing it right then and there. And we do have a timeline to, OK, here is the release blocking issue. Do we know why it is release blocking? And what are the steps that we can take to fix it? The CA signal team is essentially very important at all parts of the release cycle. But their importance really increases after we do code freeze because then the code becomes stable. And our CI gets a lot of soak time to, OK, if there are any flakes because of the code which is in transition, those get weeded out. And then at the end, we get to test for almost a month. And then we know whether the thing that we are going to ship at the end of the release cycle is reliable or not. At the end of the release cycle, I just wanted to call out, we have a big tide of PRs. Here, there are almost 70 pull requests which get merged at the end of the release cycle when we lift code freeze. Basically, we enter code thought. The reason we do code freeze is because we, as I told you, we want to soak our CI and do all the tests that are needed periodically for ensuring the quality. And at the end, these are the people who have been contributing, who have been merging in features or things like in between that period. Having said that, I would let Madhav talk about wildfires. Yes, so now that we know a little bit about what goes down during a Kubernetes release, let's start talking about the meat of our topic. So we have wildfires, firefighters, and sustainability. It's quite a mouthful. So let's try and break it down a little bit. So wildfires. Let's try break and break that down also a little bit. Fires, what am I talking about here? So when I say fires, I largely mean some of these things. Any test that starts to flake in our CI because code can be unreliable, something might be overlooked while reviewing, there might be things that no one could have thought of that might go wrong once the PR is merged and some test starts flaking. The problem with this is that it slows down PR velocity and it affects how many PRs can get merged and it affects what PRs get merged in a timely manner. So that's one type of fire. Then we have a failing test, meaning it's not flaking. It might not pass. It might not fail. Sorry, it might not occasionally pass. It's just going to fail throughout the time, right? And we are going to see some examples of this soon. And then we have a release blocker. This is stop everything. It doesn't matter what freeze we are in. We need to fix this so that we can ship Kubernetes on time to our users. And these are issues which sometimes might require collaboration across communities. It might require collaboration across various groups that we will soon see. And there are other types of fires that you might come across in the community that we won't look at today, such as regressions. Regressions as something, sort of a special case that is something went wrong after it got merged or after we shipped something. This is actually something that we will talk a little bit about very soon because it was a very interesting case in the latest Kubernetes release that is 1.27. And you have the usual bug, which something got merged which it shouldn't have, you know, not to blame anyone but like, you know, we need to fix it because we can't fix something which has bugs. Don't quote me on that. So, flakes and failing tests can often be release blockers, right? Now, when I say wildfire specifically, I'm going to be talking about release blockers. And release blockers have by and large the following characteristics. To fix release blockers, you might require inter-domain collaboration. So in the case of the Kubernetes community, what I mean is collaboration across things. So, for example, scalability regressions might have collaboration from the advising, the scalability group might advise something and let's say API machinery or node would have to, you know, work with them to fix that. It might even require inter-community collaboration. We very often collaborate with the Go community to try and bring our releases back on track when something breaks, when a new Go version is released. There are some very exciting developments happening in that area. And then, you know, most importantly, fixing and solving release blockers for the most part requires specialized, undocumented context and knowledge that only a few folks in our project have only a few project veterans possess. And there are valid reasons for this, but it might start to get unsustainable to the point where we might have to do something really quickly. So, usually release blockers happen towards the end of a release cycle, but not necessarily. So for example, if I look at the distribution of release blockers for 1.24, a lot of the release blockers happen towards the end, but for 1.27, you can see there are quite a few even along the middle there, which is fine, you know, but when it happens towards the end, that's where things start to get really scary because as and when you reach the end of a release cycle, you have lesser and lesser time to get confidence that our CI is actually conveying what we wanted to convey. Okay, so that's about wildfires, right? Firefighters. What do I mean by the term firefighters? So firefighters typically get things back on track somehow. We will see what somehow means, right? By taking a few examples of what has happened in the recent past, they have specialized and cross-cutting knowledge of the project. And typically, but not always, they don't really own the component that is on fire. These are the folks who sort of do whatever is necessary to get things back on track using the specialized and cross-cutting knowledge that folks who actually own the component might not necessarily have because of lack of time spent in the project or for whatever valid reason that you might think of. So firefighters help with wildfires and the area on fire is typically not owned by the firefighter and this is the crux of the problem because if we have a small set of firefighters and one of them leaves or a few of them leave, we might not be left in a good position for a release that might happen soon. Okay, so what's the typical flow of fighting a wildfire, right? Like we have largely two things. You have triage and root cause and then you have the actual fix. Now, both of these things can encompass, might have the following characteristics. So the triage or the fix or both might be isolated to a specific component of the project. So, you know, if it's isolated, it might be easier to get done and fix. It might require cross-area collaboration. So in the Kubernetes projects case, it might require cross-sig collaboration. So area X would have to work with a different area Y. So for example, network might have to work with, let's say node to fix some Q-proxy issue that came up, right, for example? Or it might require cross-community collaboration. So if Go releases a version and Kubernetes bumps the Go version, very often things break. It might not break as often now, thankfully, because of some nice things that are happening, but we often work with the Go team and we often work with the static analyzer team, the community there, or any tool that Kubernetes relies on and not just any tool, any dependency that Kubernetes might rely on as well. And the triage and RCA stage is typically where people who fix fires don't really own the component. So this is where the problem might lie, right? And I wanted to validate my understanding of that. So we went and got some data from the last few releases that is from 1.24 to 1.27. And if you look at the fires that occurred and we look at the percentage of triages who don't own the fire, it's the majority there. And we look at the percentage of folks, that is the people who actually go and fix the fire, most of them do own the component. So a lot of the unsustainable firefighting might happen in the triage and RCA stage and not really in the fixed stage. So it's kind of like, okay, we need to do something. Let's figure out what we need to do. So if I am someone who doesn't own a component, I'm like, okay, this is the root cause of it. Post the root cause, publish it in a place which is accessible, and then someone who actually owns the component goes ahead and fixes it. Okay, now sustainability, which is probably the most important thing we had to talk about in our community. Sustainability, there is a surprising amount of research out there in academia which talks about sustainability in open source projects. One such is like, Eleanor Ostrom, she won the Nobel Prize in economics. So in her work, as long as the average rate of withdrawal does not exceed the average rate of replenishment, our system is not going to be sustainable or like the other way around if you want to remove the negative there. So that being said, in the Kubernetes project, specifically for the Kubernetes slash Kubernetes repository, the number of committers per releases on average going to is increasing. That is, this is from 1.24 to 1.27. If we look at data of fires that occurred, specifically for fires that occurred that were fixed by folks who don't own them, that is on every release, it's more than 50% of the total fires. That is higher the blue bar is, that's worse, basically. So that being said, if we look at how many firefighters are there per release, right? That is people who help with release blockers. The number of firefighters are always less than or equal to the number of fires that occur. And these are fires that they don't own for components that they don't own. More importantly, it's the same set of people every release that are doing this. It's not a different set of people every release that are doing this. It's the same set of people and they're handling more than one fire of an area which they don't own. So, and it's been the case historically for quite some time now. What can we do about this now? Recapping our flow of firefighting. This is how it looks like. You triage RCA and then you fix. That being said, Naman. So talk is not complete without this meme, right? When we talk about fires and sustainability. So what exactly are these fires? We kind of have metrics on how many fires do we get? But let me just give you a sneak peek into or end up dive into like, what have we gone through in the past few releases? So this is very recent. This is actually something which happened last Thursday. So you might know like Kubernetes 1.2.7 was released on Wednesday or Thursday. I don't know that. I don't remember the exact date, but as soon as the release happened, we have great consumers of Kubernetes who found a regression in Kubernetes and they triaged it to which commit caused it and then they tried to revert it. But that's not the solution to the problem. That's just part of the solution because now we have fixed it in master, which now needs to be cherry-picked to our branch and then we have to release. I apparently don't have internet. Oh, that's a fire. That's a fire. While we do that, anyone has questions so far which we can like, that you might have asked was the end, but you wanna ask so far. It looks like we are missing only a few slides, but... Cool, let me do it like without slides for a moment. So what happened was there was a regression and the regression needed to be cherry-picked to the release branch and then a release needed to happen so that our consumers don't use a version of Kubernetes which doesn't work. And that needs time and that needs proper planning. We do it maybe over the course of seven to 10 days, but what I wanted to show here is meticulous planning. For example, Demes mentioned you're like, okay, what is our plan? Marco is going to start the patch process from Europe and then Jim is going to take over the patch process in US and then Ben who is part of our Google build admins team would basically cut the DBN and RPM packages and then we'll publish the release. Now we went through like, okay, since we're in US doing an out-of-bond patch release, why not we just merge the other patches that are also there in the tree? Which is kind of what we can do. We have one more slide which is missing, but then what I wanted to show in that slide was essentially we released a new version of Kubernetes in 23 hours and a few minutes change. So on 13th evening India time, we found the issue. 14th evening India time, we had a new Kubernetes release for consumers. So we had like turned around time of one day. Now, what are the observations? Now detection was possible just because we had consumers who consumed Kubernetes as soon as it was available. We still have majority of our consumers on older versions of Kubernetes, but consumers like this help us a lot. So I would really urge you all if you can try out our release candidates or even our dot zero releases, we would be really happy if you use them and come back to us with feedback. Community release managers and triages were available around the globe. As you saw like there were people to take over parts of the process during the day wherever the sun followed. And I would like to thank Andy who filed the revert, Dames, Jordan who triaged and ensured that everything is done. The Kubernetes release managers actually got the release and the Google build admins to cut our deep end packages so that we could announce on time. Now that's one story. Now I'll go back a few releases. So this is exactly one year back. So I talked about something one week back, now one year back. So what happened was, mother talked about we have dependencies also on the Go project. Now in Go 1.18 something happened. What that resulted in is this. So everything was failing. Almost a good chunk of our important core tests. Not everything was failing. That'd be exaggeration on my part. But what happened? So in Go 1.18, the crypto library was modified, like it was, there was a patch where they told like Shaoan would be rejected. But that was only for certificate creation not for like certificate signing requests. Now it also started to reject those requests which was a regression in Go. Now how do we fix it? Because our CI would remain red until we do it and we can't release Kubernetes. It was between our release candidate and the dot zero release. What do we do? First triage. So Nikita helped us with the triage. What is happening to see, to go through the logs and go through whatever has been there in the internet about this issue. And we found out a fix. Thank you, mother for patching that. But it was a quick fix. Now we knew it was a quick fix. We were very diligent that we are going to do something for temporary return because that was also suggested by Go community. Now the actual fix has to be done before the release. And we also need to like charge the best course. So here if you see, DIMM started out like what we need to do. Basically the fix in Go needs to land on Go's tip. And then it has to be cherry picked to Go's one dot one eight release branch. Then the release of Go one dot one eight needs to happen. Then we bump our dependencies. We build our images. Then we can release. Should it be that simple? No, because we have sub fires. The Go one dot one eight release was pushed back by like couple of days or I think it was half a week or so. And we had to modify release timelines like that. And Go does not have a fixed release cadence for patch releases. It only has a fixed release cadence for minor releases. So we were lucky enough that one dot 18 dot one was scheduled just a couple of days later. That also got pushed. But if that wasn't scheduled, that might have been like, you know, hair on fire, running around everywhere. And thanks to our friends from the Go community. We really love how they work with us. Whenever we complain about certain things to fix things really quickly. So what did we observe here? So folks with cross functional knowledge helped here who knew the tooling, who knew the machinery of the project. Basically they needed to understand like where we are using certain code which was breaking. And we also needed people who can coordinate with the Go community and know how Go releases their software. So that helped us a lot. Now we'll talk about what went right. I'll basically go back to all the stories and figure out like what did we, what do we as a community do correctly? We have good amount of people who can dissect issues and get them into actionable chunks. We have the correct set of tools. We have something called test grid which was screenshot I showed earlier. Here you can see like some tests are passing, some tests are flaking, purple means flaking. If it fails, it will be red. Then we have something called deck where we can see all of our jobs, what is running, we can filter through the jobs. We have something called spyglass where we can see the logs of those jobs. And if you're not satisfied by looking at one job, we also have an aggregator which would aggregate through 90 days of logs and show you like, okay, did this error ever happen in the last 90 days and what was the frequency? It can essentially show a histogram across time. Another thing we have correct is we have a really great diverse set of contributors. This is just the top 25 list. I have missed a lot of countries here, but I just wanted to chart out like our diversity, like we have contributors across the globe. What do we do? Good again, like we have a great set of people. So here I am just mentioning the commuters to the Kubernetes project who have employer support and who are independent. So we have to focus on the number which is there. We have almost 4,000 contributors who commit to the project and then we have a great set of people who are supported by their employers to work on the project, which helps us a lot because they get dedicated time to work on the project. Now what can be improved? Okay, not to be the better of bad news, but like I keep talking about things that go wrong. But you know what that's important to talk about. So let's talk about that. Now, strategically growing owners, right? And what do I mean by this? Growing owners or approvers, reviewers in the project is critical, period. It's undoubted. It's been the theme of all, almost all KubeCons for the past two, three years. If you're attending contributor summits, it's been the theme of that as well. But looking back at our fire stories, maybe we can do, we can be mindful about something else. Having a geo-distributed set of owners can help us quite a bit. And let's see what I mean by that. Let's say that our firefighters are in one part of the world and our owners are in another part of the world. A fire occurs. The firefighters do a quick fix or a fix or an RCA at least. And time goes by, time goes by. Firefighters are in, it's night for the firefighters, owners, just about 6 a.m. Finally, owners are ready, had their coffee, ready to work. They see the fix or slash approve and the fire goes away. Now, what can we notice from here, right? What's important to note here is that in between that whole time, even though the fix is out, everything is ready, there isn't anyone to actually go ahead and approve your change too much. So in that entire timeline, the master branch is still blocked. Something else is still broken. A release is still not on track. Now, more importantly, what this implies is, when we have something that is blocked for a long period of time and when something goes right and we can start merging things like how Nabarun showed the tide example, you have a lot of PRs merging at the same time, which means there's a higher probability that the CI is gonna start flaking or crashing over there. You don't have enough soak time, you don't have enough confidence in the CI. So having a geo-distributed set of owners helps. Along with this, bringing back things on track, as I said, and more time for the CI to soak. Reliability, right? We don't need firefighters if we don't have fires in the first place. That's never going to happen, but there's always room for reliability. And when I say reliability, at least for the Kubernetes project, I mean sick testing. So there is exponentially positive return in investing in reliability of a project that you care about. There has been a great amount of reliability work in the Kubernetes project. It's in a much better shape now than it was a few years ago. So a huge, huge shout out to sick testing and WG reliability for that. There is effort in sick testing to make things better. And if you are an end user or a vendor or just someone who cares about the Kubernetes project and uses it in some way, investing and funding people to work on reliability of the project, even though it might not be shiny and fancy work, is critical to your business. Having more firefighters, of course, like reliability is there. We will continue working on reliability. While we do that, there are going to be more fires that are going to come up. And there's this great piece of resource that was there and was basically stated, not all participation should be equal in projects and communities. There are going to be participants that are going to be maintainers. The key is that how do we grow participants from participants into maintainers? And it talks about a few really interesting things. I can share the research work. It's really cool if you check it out. Now, having more firefighters, how can we do that in our community? Undocumented context. This is context and knowledge that only a small number of project veterans have because of which we rely on them. And they've been doing just such a phenomenal job and we can't be grateful for that. We can't be more grateful for that. As a first step, we should start doing and publicly posting postmortems of all the fires that we have. We do sort of have postmortems, but they are in the form of the issues itself, but we should do it in a more structured way and make it available to anyone to take up and read. Enable folks who are potential firefighters. And what I mean by potential firefighters is if there are folks who have already helped with something that has gone wrong in the past, in some form, what we can do to encourage them better and do things better is if something goes wrong, instead of just putting a fix there and writing in the PR description fixes hash issue number, instead of doing that, let's provide more context as to what can be done next, how we arrived at said conclusion, why we're doing what we're doing, and essentially give actionable chunks for people to do and help out with fires, basically. We have amazing teams like the release CI signal that Navarone talked about, who can be enabled to be the entry point of firefighting, for example. So that's a really good area to explore. And there is this really good talk that happened quite a few years back on having more firefighters. And this is on finding flakes and debugging them live. So this is something if you're interested, check that out. Takeaways, a globally distributed set of contributors which we do have and with employer support which we do have, but we need employer support and backing for non-shiny, non-glamorous areas of Kubernetes, such as testing and reliability, which is what we need critically now more than ever. And we need people, as Madhav mentioned, we need to train people who can triage and debug the fires both at the same time and reduce friction, we don't want people to hand over fires between two different teams. If we can enable everyone to do both the things, we would have the correct set of people who can do basically all of the firefighting at the same time. And we have to equip our contributors with the right set of tools. Fortunately, we do have, but since this was this talk is mostly for, it is also designed to be a set of things that we need to introspect in our community and also something that other projects in the CNCF community can learn from Kubernetes project. So you should have a globally distributed set of contributors with the right employer support equipped with the right tooling who are trained to triage and debug fires at the same time. With that, thank you so much for attending our talk. I think just one call out. If you want to learn more about what we do, please do join us at the SIG Meet and Greet, which is going to happen tomorrow at 12.30 p.m. at Europe Fire 1 in ground floor. So it's that building right there. I think we have two minutes. Maybe we can take one or two questions, but do leave a feedback of our session. Please do scan the QR code as we take questions. So anyone? Thank you. Any questions? Thanks. A big part, of course, was about how do you find more people to be a firefighter? But you briefly touched upon, shouldn't you invest way more in not needing those firefighters? And I noticed that with ourselves as well. If I celebrate fixing the bugs too much, we get people who only fix stuff instead of hey, let's prepare and look ahead and that sort of, and then you also mentioned it, the people that actually fixed it, you called them by name, but not the ones that did all that other work. Yeah, maybe that's something that, I definitely need to learn a bit more on, I don't know maybe. So in the examples that we showed, we had a set of triagers and fixers both at the same time. So if you see like the regression, who triaged it, and it triaged it, Dims and Jordan planned it, and then who made the Kubernetes release, the release managers and Google build admins. Now to your previous question, how do we not incentivize people to make more mistakes? People can make mistakes, but that's where like the right set of tooling and investment into integration testing comes into picture. So in the Kubernetes community, we have unit tests, integration tests, E2E tests, and we do have like on-formance tests as well. Sometimes it does happen that something passes by and that's where like we need to introspect or do a retro whenever we do a post-mortem. If you see like something was missed, then we like write a test for it and then so that it doesn't happen again. Although we can't control situations where for example the go regression. So that is something which our tests actually caught, because we did not ship it to the customers. I think we are done. But we are around here. We are in the hallway. So feel free to ask any questions.