 My name is Jeremy. I am a co-chair of Sig release and I'm co-chairing the working group LTS. I'm also an engineer at Microsoft Azure. Today we're going to talk about what's up with Kubernetes LTS. And in this, we're going to talk a little bit of history, kind of give, give some, some insight into where we've been. This is not the first time that the topic of long-term support, which is a nebulous phrase, has come up in the community. So we'll give a little bit of background there. And then we'll talk about what we've started to do with the current iteration of the working group and how folks can be involved and help us out with that. But first, let's talk about why this topic is perhaps interesting to folks. So we've got a survey here and it's looking at which hosts are running which Kubernetes versions that are more than 18 months old. The text is a little bit small, but you can see that there's a large portion of these things that seem to be running older versions. And the tail kind of clumps around certain versions, it looks like. We don't have a really big thing at the very end, but there's definitely a peak in there. Same survey, and we're both from a Datadog survey, show that over 30% of hosts were running a Container D version that was unsupported at the time. So we've got Kubernetes and Container D. Seems like both of these technologies are maybe things that are a little difficult for folks to deal with. And I think the question that's driving the working group now, and I think maybe in the past, was why is that? What is the underlying cause for that? So this is a quote from previous working group notes. Minor updates, which would be going from 1.x, so we'll set some terminology there. Kubernetes minor versions are going to be 1.20 to 1.21, something like that. Kubernetes patch versions are going to be something like 120.0 to 120.1. So the quote from this end user was that minor upgrades, so going from 1.20 to 1.21, for instance, feel like major updates, and they're happening too fast and too quickly for people to quite keep up with. So a couple things to unpack from that. They're happening too fast. We'll get to that in a little bit. But also they feel like major updates. So what's the reason for that? Well, if you've ever worked in software, I think there's always a push and pull between velocity and features and supportability and fixing things. So there's always a tension between going faster and going slower and figuring out what the right balance for that is. And answering that question, I think, is really going to help us understand what we can do to help maybe make the experience better for folks. Okay, so we've had a couple definitions already, major versions, minor versions, patch releases. Let's also talk about what LTS might be. So here's the definition from Wikipedia. Long term support is a product lifecycle management policy in which a stable release of computer software is maintained for a longer period of time than a standard edition. The term is typically reserved for open source software, where it describes a software edition that is supported for months or years longer than the software's standard edition. So it's pretty generic. We have a lot of flexibility in what that means, but I think in general most people's interpretation of what LTS or long term support means is you're getting an extended period of time. You're getting something that's going to maybe be double, some longer period. I think it varies from project to project what that might look like, but that's going to be the basis for what we're going to look at. Okay, so I mentioned this isn't the first time this has been discussed, so let's go back to when the first version of this started. And see, we've got Tim in the room here. I've got a screenshot of an email from Tim. This is from October of 2018. And I think container years, that's a long time. 2018 is before the pandemic. My daughter was pretty young at that point. And really this email I think summarizes the state of things at the time. A lot of words here, but I'm going to pick a couple of things out. Should things be going faster? Should things be going slower? What is going to be supported? So out of this a working group is proposed, but I think my favorite line in here is the parenthetical. Working group LTS is simply shorter than working group to LTS or not to LTS or working group, what are we releasing and why and how it's best integrated, validated and supported. A lot of things in there because it is definitely a question that touches on many, many, many things in the project. And I think importantly it says this should not be read in shortness to imply that LTS is a foreground conclusion. So the working group is started, the working group is looking at many facets of the project and not foregoing a conclusion that there will be LTS releases. And we are I think going into this again with the same kind of underlying assumptions. During this time, this is quite a while ago, so again we're about to see the release of Kubernetes 129. At this point, Kubernetes 113 was the version that was being worked on and released. 1.9 and 1.10 were a few months out of support at this time, but we can see that there were quite a few people in this chart still using 1.10, still using 1.9. So is that pattern something we see today? We'll get to that in a second, but I think the push and pull between faster, slower, keeping up to date is something that emerged as a pattern. When the working group was started, there were many, many things that were discussed and lots of questions to be answered, lots of discussions about various aspects of the project. You can find the old working group notes at this URL and I put a QR code here if you want to go take a look at them, but we're going to use chatGPT to try to summarize two years' worth of work. Going through many, many months of meeting minutes is really difficult, so we're going to try to use some AI help to help us understand what things were happening inside of that space. Okay, so one theme that emerged, basically what I did was I fed all the data into chatGPT and then I asked it to summarize top issues or top concerns, things that the working group wanted to work through. So the first one that came up was that the goaling upstream release cadence and support cadence was maybe not aligned really well with doing longer releases for Kubernetes. There was a debate whether the current release cadence of nine months with three releases was good or whether that should be shifted to support a different version, but a lot of the text was around that support window. Next thing that came out of it was stabilization of releases and the use of beta APIs. So there was a discussion designating that the last release of the year should maybe be a stabilization or bug fix release. That has not happened. We see our end-of-year releases continue to be pretty feature-rich and maybe a little busy. The conversations involve stig testing, stig node, and various other special interest groups because that's what working groups do. I think an interesting one at the end of here is that the idea of removing of graduated beta APIs from the core was also addressed. I don't think that's a right summarization, but that's what chatGPT said. So I think the idea there is to maybe disable beta APIs by default. So that's one that we're going to come back to. Another topic was supportability and long-term support working group. ChatGPT seems like maybe it's going a little off the rails there, but the group discussed whether they should rename the working group to be more focused on general supportability. The suggestion of creating a reliability working group was mentioned in the text. Surprisingly, we saw that come out, too, as a takeaway from this work. Next up, we get the working group chair rollover is discussed. So this is something that is discussed over two years, right? Two years is a pretty long time. One common thing was whether the working group should be continued or disbanded. Whether certain things should be handled by SIG release and whether the working group should continue or not. A discussion for this topic was scheduled for September 22. ChatGPT was pretty good at summarizing that one, I think. Overall, the challenges seem to involve striking the right balance between development and stability. And meeting the community's evolving needs and priorities. So if we go back to one of those early slides I had, there was, I think, a tension between velocity features. That seems to be a pretty reoccurring topic that comes out of a lot of the discussion inside of this working group. Okay, so a lot of topics are discussed there. What is one concrete thing that comes out of this working group? So another piece of context for this time period is that releases were only supported for nine months. There were four releases per year. So you don't get a yearly support period. Things are supported for less than a year and there's multiple releases happening in a pretty fast cadence. So a KEP or a Kubernetes enhancement proposal comes out of working group LTS to increase the support period to one year. This is, I think, a data point that came up. A lot of businesses are operating on yearly business cycles, business cadences. And this was a thing that the project could do without a lot of change. This is something that would be pretty impactful for end users of Kubernetes, but not necessarily require a lot of work on top of what was happening already. To do something like this, we're going to have to extend the support of branches. So cherry pick things back maybe a little bit later than we might from future versions. We're going to have to run tests for a little bit longer. We're going to have to have maybe a couple extra releases that folks from CIG release are going to have to support. But again, pretty straightforward in terms of supportability, I think, for the project. Okay, so we see this was opened on January 20th of 2020 and it was approved on April 29th, 2020. And I think I accidentally approved this before it had general consensus because I had just been added to the top level approvers file and LGTMs acted as approval. But you can see there's a few months skew there. I think getting these kind of discussions fully approved is kind of difficult. If we go back, we can also see this was edited to start with 118, but then it was for 119 because time goes on and releases are happening. Okay, so we get to April 29th, 2020, and this policy is basically put in place. There aren't really any code changes that go into effect here because we're not changing anything other than some policies. So we get to 119 and 119 and newer releases are going to have one year of patch support, which essentially means you can do yearly upgrades essentially and follow your business cadences and plan things around when you're most likely to be able to do those things. A consensus is made to end the working group because that was the logical point of wrapping things up, I think. People also had started to move on to do other things. And the other questions didn't really have great answers. There were some blockers. If we go back to some of the things that were summarized, keeping up to date with Go versions was something that was identified as a challenge inside of that documentation. But this did drive a lot of good discussions and concrete outcomes. There were topics to have logical owners going forward, such as CIG release. One of them is, are we doing too many releases per year? So the working group wraps up. One thing about working groups is that they don't, if you're not familiar with Kubernetes Governance, working groups don't really drive feature implementations. They don't own code. They're working across different CIGs to help identify things that can be done for the project and to work towards a common goal that maybe is cross-cutting. So that last topic, last point there, that it drove good discussions and there were concrete outcomes and next topics feeds into the next thing that happens. So as the working group was going forward, the topic of release cadence comes up. So if you rewind the clock back to 2020, there were still four releases per year. So they're basically quarterly and every quarter you pretty much count on a new release coming out. So around three months. That was not addressed as part of working group LTS because it's a pretty CIG release specific thing. There obviously are other implications to that. It's going to slow down velocity for the project a little bit. It's going to take longer for features to graduate because a lot of that is predicated on releases happening. But CIG release takes this up. And we see at the end of 2020 into 2021 that a new cap is open to add a release cadence cap that's going to discuss changing the release. Cadence from basically four to something else, it ends up being three. So we have April 23rd here is when this was merged. March 12th is when it was opened. That happens. But one of the comments that I found really interesting inside of the discussion happening inside of the LTS working group that kind of thing feeds into the cap for the release cadence. And this is from Tim, who we saw earlier this morning. What makes us believe that non-annual nature of the cycle resulted in the situation that people had? Rather than users who simply don't or can't make the time to upgrade. So is it the requirements that folks have for doing yearly upgrades? Or is it really that it's mostly working for people and it's not a core priority? People are focused on doing the work that they want to do and they can't or won't make the time to do this. So his question is what prevents us from doing the survey next year and finding out that 30% of users are now 15 months out of date instead of 12 months out of date? And I think that was a really interesting question. And it was one of the things we considered when we were doing the release cadence cap. So let's find out maybe if there's any data that can support this. So we're going to find out. So the production readiness team does a survey every year to look at what clusters are in use and to gather a lot of data about reliability and to really understand how that process is shaping out for everybody else. Production readiness reviews are required for all features now and it looks at a lot of things around sustainability, monitoring, how things are going to be handled for upgrades, what frictions people might encounter there. But I think the really important one for this question is what are the oldest versions that are in use? So here we've got the PRR survey in 2021. So it looks like at this point, 121 is probably the most recent version. But that seems like a lot of versions that are trailing off of 121. So at that point in time, 121 would have been supported. I don't know if the cap had gone into support for four releases or three releases per year, but we can see 113 probably not in support at that point. 116 probably not in support at that point. So these things are going to start to, if this pattern holds, these things are going to fall out into that 15 month window. I think it's an interesting point to look at. 2022 looks pretty similar. So at this point, 123 is going to be the most recent version. And we can still see a lot of 116 in there, a lot of 120, a lot of 117. So it seems like maybe this pattern is holding. And here's the most recent one from 2023. And in this one, only 40% of people were running supported clusters. So we do have a lot of folks falling into that longer tail. And I think it's a question that we want to answer and better understand in the working group. And that's some of the work that we're doing today. We want to understand why is that the case? We had a little bit of discussion for the session started today from some folks in the front to talk about what reasons are people stuck or what reasons is kind of holding them back. And that's really what we're looking at inside of the working group. But have we stopped as a project since then? Has anything been done that could maybe make the situation better? Maybe make things easier to do longer support periods? So the answer is yes. Let's talk about a few of those things in a little bit more depth and maybe understand how they are impacting people today and whether people have gotten to those enhancements yet. Because again, as features roll out in Kubernetes, they're going to be in newer versions. And if you can't get to those newer versions, is that something that's blocking you from being able to take advantage of them? So that's one thing we're interested in. But I think a lot of these improvements are interesting to discuss in terms of supportability, maintenance of clusters for both maintainers in the project, but also end users. So the first one that I think is going to help people going forward is that Kubernetes 1.19 and later ensure that all APIs required to run a cluster are GA. Kept 1.333s for that. New APIs are not allowed to be required until they graduate to GA. So think about that one for a second. Who has run clusters back in like 1.16? Did anybody have problems with the 1.16 release? Any specific pretty core APIs? So I think this is super useful for that. As you are deploying clusters today, you're building new workloads out, you're taking advantage of features, you're very much less likely to be bit by things going away, things being deprecated. But again, that's something that didn't come in until 1.19 and later. So there's a little bit of a tail for that. We still see some things going away. PSP is probably the most recent one. But I think it's getting better, right? If we compare now to 1.16, this is probably a much, much easier story for people. Kubernetes 1.19 and later return warnings and record metrics for use of deprecated APIs. So this is another one, I think, having run clusters back in that time frame. Trying to figure out what we were going to be bit by was kind of difficult. And getting more of this data now, getting warnings back when you're using Cube CTL, being able to query the metric server and get data about what you're using is pretty useful. Kubernetes 1.19 and later features require production readiness reviews. So again, I mentioned PRR a little bit ago. That was KEP 1194. If you look at the PRR template inside of a KEP, it's going to talk about upgrades and making sure there aren't going to be things that are exposed to people that make it difficult to do those upgrades. It's going to talk about version compatibility, changes to default behavior, which is, I think, a really important one in monitoring. The deprecation policy was updated to make stable API versions more permanent. I don't think we had really ever removed stable API versions, but this codified things much more as a policy mechanism to make sure that we weren't going to be able to do that in the future. Okay, so now we're going to move into more recent versions. And Kubernetes 124 and later only enable new APIs by default once they are stable. Previously, we saw things like beta APIs being turned on by default, which probably encourages people to use them to get feedback, and I think that's a great thing. But it also maybe makes it a little bit more easy for you to consume something that might change and might have some impact going forward. So that one's KEP 3136. Kubernetes 123 releases and forward now stay on supported Go versions. This is a lot of work, but I think for us on the release team and for folks that are maintaining releases, this makes things much nicer. We don't have to maintain Go 119 and 120 or 120 and 121. Things are much more consistent and we can take security fixes and be much more aligned with the Go upstream releases. If we have versions or branches that are stuck on older Go, if there are CVEs fixed in newer Go, it would be really difficult to be able to quickly bump and do those things. There's a lot of work when we bump Go versions. Carlos was just doing this during KubeCon a couple of days ago. And I think this one's really, really useful and maybe gets to the answering the question of yearly upgrades. 128 and later now support N-3 versions Q for the control plane and the nodes. As we had made changes to change the release cadence and extend the support period, the version skewed and actually documentation didn't really get updated. And there's some things that fall out of this for testing things with version skewed to make sure that things are working. But it really does now allow an annual node upgrade with a single bump of your nodes. So you still have to do that piecemeal series of API server upgrades. But now you can just replace your nodes with the latest version and do that three version bump. And I think that's pretty useful. But there's a lot of discussion still from customers that we hear. I think there's a lot of discussion we had in the room about wanting to stay on 124 for another two years. So at the contributor summit in Amsterdam, we had some more discussions about what long-term support might mean for the project. Was it something that we could take a look at again? Could we do this? There was a pretty lively discussion, and you can see in this bottom picture, there were quite a few people in the room. There was a lot of documentation, a lot of meeting minutes that came out of that. So I used chatGPT again to try to summarize what came out of this. And the four things that really, I think, bubbled up as important topics to discuss were the upgrade path complexity, support content, like what would be in LTS releases if we're going to do LTS releases, how we handle dependencies for longer support periods. Kubernetes is not an island. It has a lot of dependencies, not just go. There's other libraries and things that are included. How do we support those things? How would that hinder the supportability of LTS releases? And then I think infrastructure cost was another really, really critical piece to think about. If anybody has been to the last few KubeCons, you probably have heard a lot of discussion about credits from various cloud providers, reporting things, running it out of budget, and all the work that people in sick testing and sick kits in for are doing to make things more affordable and split across different clouds. If we keep branches around longer, we see an increase in costs. It's just an unavoidable thing. And that is a really important thing to consider. We have actually had a couple of releases stick around in the past year or so, and that's the 14-month period. So the cap changed support to a year, but actually it's about 14 months because of the release cadence change and the version skews. So we have had a couple of releases that have actually stuck around a little bit longer than that 14-month period because of late-breaking CVEs that we wanted to try to pick back and do one final release to cover. And it is not a negligible amount of money to keep the branches around because there are so many tests that run. There are so many things that are really required to keep those things around. So thinking about what that might look like in terms of support is really, I think, a really driving factor. But out of that came a high-level goal of, I think, longer support, in quotes, that could be something that the working group would decide. And also focusing on enabling things like skip-level upgrades so that people have a better experience and have less friction to be able to go maybe six versions or maybe five versions to be able to get up to date with latest support. So we opened a new working group. We revived the working group, really. And that was in April when we started this. It took a little while to get some agreement on how we were going to do that. And a lot of that comes into what is a working group? So what is a working group? Working groups are groups inside of the project that come together in order to work on a common project. They don't own code, though, and I think that's a really important one. Merging of code into a repository is governed by a lot of policies in the project, but they're really a vehicle for consensus-building, and I think that sentence here is the important one. As we think about what long-term support might be for the project, it's going to require a lot of consensus because it's going to touch a lot of sigs. Any security fixes that we want to bring back to older releases is going to involve the sig related to that fix, right? A lot of those things are going to go through API machinery, probably. A lot of those things are going to go through sig node, probably. Those are the two places where we're going to see a lot of... that we see a lot of security things. And there's not necessarily a large group of people that can do those reviews and can make those cherry picks. As we extend the window of support for things and code changes, then we're definitely also getting to the point where it's harder to do some of those things and it's asking more work of the maintainers to do those things. Working groups are distinct from sigs and that they, again, are facilitating cross-collaboration across the sigs and they're facilitating an exploration of a problem. They are not fixing problems specifically. So the ideal outcome from a working group is probably one or more caps that are going to address the problems that have been identified by the working group. Groups are typically going to have stakeholders from across the different sigs. So in the working group LTS today, we have myself from sig release. We have Jordan who covers everything. And we've got Micah who has a lot of responsibility for the product security committee and all of us come from cloud providers and we have, I think, unique insights into what our customers are doing and we can think about how best to balance those considerations between what's good for the project and what might be useful for end users. Okay, so we have continued to have a lot of discussions in the working groups. Scott, in the crowd back here, gave us a really good overview of what OpenShift has been doing and some of the challenges they have had. We have started to hear some friction logs from some end users that are coming to share some of their experiences with upgrades and some of the difficulties they have there. We also have had a couple of really good presentations from folks at Google to share some of the things they're thinking about that are, I think, going to not necessarily check the box for LTS releases but are going to, I think, work on making the experience of perhaps upgrading and supporting things a little bit easier that maybe moves the needle for folks without necessarily requiring us to do LTS releases. And this is a fairly recent one. I think it was opened three days ago. So at the contributor summit introducing the idea of a compatibility version for APIs as well. So taking the current API version and maybe splitting that so that you can have a compatibility version for APIs as well. This is a cap that will be open soon and, again, is going to require a lot of consensus. But the idea here is that you would be able to do binary upgrades but set those binaries to reflect older versions. So is that something that's going to fix all problems? Probably not. But is it a thing that might fix the problems of these new things that have security fixes in them because I need these APIs that are going to go away? I need to be compatible with version 124 but I need to have 128 because it's going to have the things that have the fixes. So I think work like this is what you're going to see come out of the working group. It's going to definitely be a long process to get consensus across all the different areas that we really need to get consensus on. But to do that, you can come help us. So you can come to our November 21st meeting. It's going to be the next one as at seven Pacific time a little bit early for folks on the west coast. But you can also join our mailing list or check out the Working Group LTS Slack channel on Kubernetes Slack. These are going to be great places to come. If you join the working group, you'll get an invite to this meeting. It's open to everybody that wants to come. We'll be able to take a look at the recordings on YouTube, but also scan through those notes. Maybe you don't need to use chat GPT to summarize them yet because there's only been a few meetings. But you can see the things we've discussed so far, the things that we want to discuss and maybe identify some of the problems that the working group needs to look at and do some analysis on. And you can help us today by taking our survey. So we wanted to do a survey to really gather data about what options folks have for upgrades. What is the difficulty people have in staying current on versions? Are you bound by any regulatory compliance schemes that make it difficult? Are you stuck because you have dependencies on APIs that have gone away? There are a lot of free-form questions in there. So you can give us data that maybe we haven't thought of. And we opened this up earlier before QPound. We did a survey last week, and we've already started to get a little bit of data. There's a lot of questions, and it's going to take a long time to go through some of those questions and answers, but it does seem like we still have some more work to do. So we'll tie it back into the chart earlier that the PRR folks had done. This is asking the same question. Which versions of Kubernetes are you using? We're asking the same what's your newest version of Kubernetes? Currently, there are quite a few on this side of the chart that are still using versions that are now out of support. So currently, our versions that are in support are 125, 126, 127, and 128. Everything else outside of that has fallen out of support. It actually 125 at this point is now done. We're not going to do another release of 125. If there are CVUs tomorrow, you're going to probably have a bad time because it's done. We've got some more things to do. I would really encourage you to take advantage of our survey, provide your feedback, help us understand what problems you're having, what things that we can help identify as ways forward for the project, and definitely think about what areas we might be able to impact. With that, I'll open up to questions. I think we've got a couple of minutes left. If folks have questions and we're out of time, we can just step off to the side or head outside to chat about things. We have a couple of microphones. Just go ahead and queue up there and you can ask and we'll go from there. So for some of the customers you've seen in the last slide, they're running on 120, 2, 3, 1, out of support. That's their API server version. There are no pools. The data plane is probably on 119 or even longer than that. What do you do about that? That's a great question. This question is, I think, specifically asking, this one is specifically asking about API versions. What do we do about people that are running older nodes and how do we handle those things? We can do things like make N-3 a supported thing. Make sure the testing covers that to cover the actual SKU for things. We can do work to make it easier to maybe improve the experience for that. We can look at it as part of the working group. I got here a little bit late. Apologies if you covered this early on in the talk. Have you done research or surveys on when you upgrade your Kubernetes clusters? How long does it take you? What is the level of effort? In our survey today, we're doing that research now. I don't think we've gathered that specific data point recently. Again, it's been a little bit of time since the previous working group ended. A lot of things have happened since then. That's one of the data points we're really interested in gathering. We want to know how long does a typical upgrade take? How long does it take you to upgrade your entire fleet? Those are definitely two questions that are in the survey. We want to understand what those things are. Then there's a couple of follow-up freeform text questions to ask you why. Does it take you that one? Can you talk a little bit about the cloud provider standard support versus the Kubernetes standard support and what that relationship looks like across AKS or AWS? Is there an AWS offering extended support from 123 onwards? Is that something that you could talk about as well? An interesting thing that came up at the contributor summit was who's doing this now? It turns out a lot of folks are. VMware is doing this already. They're doing that to support things inside of vSphere. AKS is going to do this in a way that's a little bit different from what we're doing with vSphere 127. Not identifying that every version will be LTS, but having some extended support is going to be an extra year of support. I think AWS announced they're going to do two years of support for every version going forward. So it's different from cloud provider to cloud provider and has not landed in upstream. So a lot of this discussion with downstream supporters are going to do things that make sense for their customers. Obviously Kubernetes, the community project, can't solve all problems for all people. It can't be an enterprise supported thing. We can't build upstream artifacts. A lot of people use the upstream artifacts in production use cases, but probably shouldn't. Should probably be using them from a defender or distributor. And I think the alignment between those things is an interesting problem. When people start offering two or three or four years of support for Kubernetes, that's a lot of versions. If you're building things on top of Kubernetes, that's a hard problem. How do you support that many different versions of the API server? If you're building Cilium, do you want to support back to Kubernetes 122? It's a hard problem to talk about. And I think the same thing happens for the cloud provider components that make Kubernetes useful on your clusters. At least for us internally, I think another really interesting discussion has been what do we do for all of the add-ons in AKS? AKS has a number of things that make Kubernetes useful inside of Azure. I'm sure every cloud provider has that. Also, things that make your workloads just easier, using Istio as a service mesh, for example. How do the support for those things align with that? Is another really, really tough nut to crack, I think? And definitely out of scope of this. This is definitely core Kubernetes that we're discussing in terms of the working group. It would be interesting to think about what LTS means at the CNCF level. We've seen Prometheus now has long-term support versions, but that's just the Prometheus project. There's nothing that's cross-cutting across those things. And that would be an interesting discussion to have, I think, also has a lot of potential problems and fallout that would need to be discussed. Any other questions? All right. We'll hang out on the side here if you want to come over and ask anything not on a microphone. Totally, totally get back. Thanks for coming. The walk over here was pretty far, so definitely shout out for making the long journey.