 Hi! Welcome to our Kube County, Utah. We didn't start the fire. My name is Ian Coldwater. And my name is Kat Cosgrove. This talk is a talk about communication breakdowns, a brief history of the way some of them have gone in the Kubernetes project, and how we can prevent them and do better in the future. Once upon a time, way back when you could count the cards on the landscape, the Kubernetes project was smaller and newer, and communication wasn't really a thing that we had to think about as much. A lot of people knew each other. People were more likely to come from similar backgrounds, and assumptions about shared context between maintainers and users were more likely to hold true. Kubernetes is a much bigger community now, though. Many of us don't share the same context anymore, or even the same concerns. Our backgrounds and our background knowledge can vary pretty widely. We don't all have the same understandings anymore, and we can't make the same assumptions about each other. For a long time, the Kubernetes project resisted instituting breaking changes at all to promote growth and backwards compatibility. The first breaking change that was released in Kubernetes was in version 1.16, with the V1, beta 1 API deprecations. At that point, we ran into an issue, one that has affected the project since. At first, for a long time, we didn't really have to think about how to communicate breaking changes to end users, because, frankly, we didn't have any. By the time we did have to think about it, we were a lot bigger than we were when we started out. The population of end users looked different than it had been before, but we hadn't actually adjusted our assumptions to match. We've kept learning this lesson, or at least we've kept needing to. Communicating changes to users without understanding each other's needs and concerns has had a series of unexpected results. We've kept making assumptions, and in turn, messages keep being received in ways we might not have thought they would. This is a talk about mistakes, a talk about community and about growth. We're going to talk about where we've gone wrong, but we're also going to talk about the good that's come of that. Let's go through a brief history together and see what we can learn from communication breakdowns that have happened over time. We can take those lessons into the future with us and do better for each other together. We begin this history with v1, beta 1. We don't hear about v1, beta 1 much anymore, but we do still feel the effects of this today. So, Ian, why don't you tell us what happened? So, it was announced a couple of versions before this would happen that in Kubernetes 1.16, for the first time, there would be breaking changes for users. Some deprecated APIs were going to be removed. Endpoints that had ended in v1, beta 1, some of them were going to be changing to v1, and ones that might have started with extensions before were now going to be starting with things like policy v1 or apps v1. This was going to cause breaking changes for users, especially ones who hard coded these endpoints into their code bases. We've all seen spaghetti code, we've all been there. For some users, this was going to be a big deal. But we didn't really realize that entirely at first. We started by communicating those details, the way that we communicated details about any other change in Kubernetes, in the change logs, and the release notes. We also knew that because this was going to affect users more than your average change, we would put it out in other community resources. There was a TGIK episode made about it in advance of the change. If you've never heard of TGIK, it's great. Highly recommended TGIK.io. And also there was a blog post that was released on the Kubernetes blog to let users know about the deprecations and the changes that were coming up so that they would be able to plan for them. All of this was good. We put it out in a range of community resources to let users know. But not all users got the memo. Some users didn't hear about the change until right before it happened. Or worse, some users started finding out about it after their APIs started breaking. Although we had put out announcements on this range of community resources, not all users were that connected to the community. Or were even aware of the existence of these resources. They didn't really know where to look. Core maintainers had made two assumptions here. First, that these changes wouldn't be a big deal for most users. And second, that most users knew where to find detailed information about upcoming releases. Maintainers were used to thinking of things in a much smaller context. For a lot of Kubernetes history up until this point, we were our own users. So we really did know what most of the users were doing and what they wanted and what they knew. When Kubernetes started seeing widespread adoption, nobody stopped to ask, hey, who are our users now? And as it turned out, the population of end users by then looked different than it used to. The newer population was larger. They had more varied backgrounds and more varied levels of background knowledge. And often they were more disconnected. They weren't plugged into the community, maybe didn't know where to find those community resources. We don't all know each other anymore. And with greater, more rapid adoption, that becomes more and more the case. And for that matter, our users may not all know us anymore. If we had talked to end users more to understand their use cases and concerns, we might have anticipated better in advance how these changes might have been received. As it turns out, more users had been hard coding their endpoints than I think a lot of maintainers had originally realized. And a lot of users got kind of freaked out by this. A lot of users didn't want to update in order to avoid these breaking changes. And the fact that a lot of users were really hesitant to update for breaking changes caused ripple effects on the project that continue today. Officially, Kubernetes has never had a long-term stable version. But functionally, in some ways, by now, it's kind of played out that way. People are still running 1.15 in the wild, which is way out of patch cycle for us. We support n-3. This not only has security considerations, an issue near and dear to my heart as Kubernetes 6 security co-chair, it has had effects on the decisions made by public clouds and managed Kubernetes distributions. In turn, this affects our decisions as an upstream project. So, okay. This wasn't really big and flashy and dramatic. You probably didn't hear about it much, but it does continue to affect us. That was the first breaking change and maybe the first communication breakdown involving communicating a breaking change. We've had others since. One you may have heard about is Dockershim. So, an announcement was made that in version 1.20, a deprecation warning would be in place for Dockershim and that in version 1.22, support for it would be removed entirely. This deprecation had been discussed within the Kubernetes project for quite some time, but that news hadn't really made it out to end users so much. And while this change wouldn't affect all users, it would affect some cluster administrators pretty heavily. So, having learned lessons from what happened with communication around 1.0.1, we wanted to make sure that the change that could affect users would be known to those users early so they could have enough lead time to plan around it. And we wanted to make sure that it would happen where users could see it. Now, if you aren't aware of what the Dockershim is, which would not be surprising considering the way this story plays out, here's a little bit of context for you. When you're using Kubernetes, you need to specify a container runtime. In a lot of cases, people just go with Docker as the runtime because it's familiar and that's what they use in development. On the surface, that seems fine, but Docker isn't actually just one thing. It's a whole tech stack. And inside of the Docker stack is a container runtime called containerD. That's what Kubernetes needs to get at when you select the entirety of Docker as your runtime. But unfortunately, Kubernetes can't just do that. It requires a software shim because Docker itself, the whole stack, isn't compliant with Kubernetes container runtime interface. This is an entirely separate thing that people have to maintain and it's been a pain point for quite a while. So it was decided that support for the Docker shim would be deprecated, forcing people to use containerD or cryo or another compatible runtime directly instead. This change is ultimately good for the community and some people thought it wouldn't be a big deal. It did, in fact, turn out to be a pretty big deal, though. One thing that a lot of us really misunderstood as core maintainers around this was the level of understanding of a lot of end users. If all of the things that Kat just said sounds like deep technical arcana to you, that's understandable. It sounds like that to a lot of people. A lot of people don't actually know the difference between a Docker shim and a Docker file or, for that matter, between Docker and Kubernetes. There's a lot of confusion around this stuff. And I think a lot of us might have kind of overestimated how much understanding there would be of these finer details. And what happened there when we announced this to end users, without understanding, their levels of understanding in a lot of cases, was there was a lot of panic. End users were confused and end users were scared. That was not quite what we were going for. So, okay, but let's go back a little bit. What happened here? The person who originally put in the deprecation PR for Docker shim announced on Twitter one morning that that was going to happen. And he tweeted out that Docker was going to be deprecated in Kubernetes. He did not mention Docker shim by name. Probably I would guess assuming that people would know what he meant. He asked for that tweet to be amplified. And first thing in the morning before coffee, I got bright ideas and I thought I would come help. Somebody with a platform on Twitter, I could amplify it. I would get attention on it. Users would see it and they could make a plan. So Ralph Wiggum style, I helping. I went and tweeted this with a kind of intentionally click baby headline, which said this isn't going to affect all users, but it is going to affect some. Pay attention to this. If it does affect you, all caps, it will break your clusters. And I meant well. I wanted to get eyes on it. I wanted users to see it. It was asked to be amplified. I amplified it. I did mean well. But it maybe didn't quite go the way that I thought it would. I did this well intentionally and then I had to go to work because I had started a new job a couple of days before and I needed to go on board. But before I went to work, I had started getting questions from end users who were pretty confused about what was going on here. Some of them maybe sounded kind of scared. And I knew that I wasn't going to be able to answer their questions immediately because I had to go. And so I went onto the CNCF ambassadors channel to ask if some ambassadors could come help feel these questions from end users since I had to go for a while. I left for work hoping for the best. And then things got weird. So Ian's at work. And a few people are clearly confused. I've got some free time. It's early in the morning for me. It's like 6 a.m. Why not step in and answer some questions on Twitter for a bit while I drink my coffee, right? Well, after a little while, I noticed that people are generally asking the same questions. So I think to myself, aha, I will write a Twitter thread. It will be helpful and people will love it and I will save myself time. Problem solved. Chef's kiss. Literally none of that would turn out to be true. I thumbed out like a 13 tweet thread explaining in extremely simplified terms what the Docker shim is, why it was being deprecated and what that meant for users. And at first it was helpful and it was saving me time. But then it got retweeted again and again and again until suddenly it's huge. And there are questions from developers who know that their company uses Kubernetes, but they don't touch it themselves. And these people are in a panic. Answering questions and reassuring developers becomes my full-time job for the next 36 hours while I simultaneously write a blog post at FAQ with a few other maintainers explaining the situation more detailed to try to quell everyone's fears. I got off work a few hours later and then realized the extent of the disaster I had wrought. And users were confused and they were scared and they were panicking. Docker was being deprecated. Does that mean they couldn't develop with Docker files anymore? What was going on here? They were freaking out. Everybody was having to scramble to deal with this. And I felt terrible. I meant, well, I wanted to be helpful. I figured I would, you know, try to amplify this to get attention on it so that end users could see it and plan around it. But I wasn't really very thoughtful about the way that I put that. It certainly did get attention in the way that it would if I had shouted fire in a crowded theater. But as these situations do, that got pretty out of control and that wasn't what I meant to do at all. I felt so bad. I felt terrible for scaring end users. I felt terrible for putting work on my friends. I apologized to everybody and everybody was really nice about it. But I felt awful. I felt like it was my fault. It wasn't all my fault. But I really messed up and I messed things up for other people. I didn't want to do that to anyone. I don't think any of us did. Ian and I definitely both made mistakes communicating here. I wanted to answer a few people's questions and I wanted to save myself some time while doing it. But I forgot that most developers actually don't touch Kubernetes. They just know that they use Docker in development and Kubernetes is used somewhere else at their company. So the tweets made them think that Docker was dead and needed to be replaced as a whole. Ian didn't expect their tweet to set the house on fire like that. And then they couldn't deal with it while they were at work. Neither of us expected our tweets to get anywhere near as much attention as they did. Those failures caused people to panic unnecessarily. The intent of the Docker shindeprecation announcement, as it originally occurred, was good. We wanted to avoid the problems from V1, by giving users lots of notice that a change was occurring. So they could plan around it. And we wanted to make sure they would see it. We did achieve that. But we missed the mark in some other key ways. And from there, we learned that we actually have to have a better understanding of what other people's understandings are. We have to talk to others. We have to understand. Because we don't actually know what other people know. And for that matter, we don't know what we don't know. But before we wrap up Docker shim, we should mention the cold watering. It's fun, I promise. A couple of weeks before the Docker shim deprecation took over, I had cosplayed as Ian for a panel at KubeCon North America. It was very funny at the time, but it didn't really go anywhere. However, for some reason, the two of us being involved in Twitter drama together, we ignited that. And the result was several days of people cosplaying me cosplaying Ian. There were wigs on everything from carps to Pokemon. You name it. Like literally hundreds of accounts changed their avatars to match. It was surreal. But it was a welcome distraction from the huge amount of stress we'd experienced while we were putting out the Docker shim fires. I really appreciated that moment of wholesome silliness. I feel like the Internet needs more of that. And that moment of levity was nice. But we as a project still had a lot of work to do coming out of that. So collectively, we thought about how we could improve, again, how could we avoid a situation like this happening next time? We didn't need to just give a lot of lead time and shout things from the rooftops. We needed to think about what we were shouting from the rooftops and how those messages would be received. We needed to put more thought into our communications when we were communicating breaking changes to avoid causing panic for end users. So, inspired by some of the lessons learned out of the Docker shim deprecation, as well as previous incidents surrounding communicating changes to end users, going maybe not quite the way that we had originally expected, the community got together to figure out what we could do better in the future. And from there, we got the contributor comms committee. The contributor comms committee was a committee that was going to get together to try to figure out how to craft messaging to users around changes and deprecations so that users could get the information that they needed without causing panic or trouble for anybody. This was a great idea and an important effort. And maybe not an entirely perfect one. It plays a part in the next communication breakdown we're going to talk about, which is, let's talk about the pod security policy deprecation and how some of the communication around that has gone. You might not have heard about this entirely and you might not have heard about some of it because actually it ended up going well. So, pod security policies. What's up with this? What happened there? Pod security policies are a very old feature in Kubernetes. They've been around for a long time. They're really important. They allow users to set security controls on their pods and while they're important, they've also had problems for years. They have usability problems. Users have found them confusing. And they've had problems from a maintainer perspective. As we've found out more about what isn't working for users in them, we've also found out that we can't actually make meaningful changes to fix those problems without breaking the API. This is problematic and we've known it was problematic. This has been a matter of discussion and debate within security corners of the project for years. We all knew that PSPs had to go, but we couldn't really agree on what to replace them with. And so there was a lot of debate, but not a lot of progress made on it for a long time. In the meantime, because PSPs had been around for so long and were so problematic, it was decided that they were going to have to be deprecated. They couldn't stay forever. However, people started talking about that before there was actually a plan. And this had the effect of lighting a fire under us because we knew we had to come up with one. But for a while, there was kind of a question mark. They were going to go. We didn't know what we were going to replace them with. What would this mean for users? Contributor comms got together and figured that this was a recipe for a scenario that might cause confusion and panic for end users. And fair enough, something that was really important to them that affected the security of their clusters that affected a lot of users was going to be deprecated. And we didn't have a plan to replace it. Oh, no. So contributor comms figured that they would get together and get in front of the narrative before people started talking about this more publicly and causing panic for users. And in order to get in front of the narrative, contributor comms figured that they were going to get together and write a blog post. They talked amongst themselves, they figured out what was going to be in it, and they wrote a draft post talking about the fact that this was going to be happening in an effort to try to get information out to users so that they would know what was going on. Now, this was well-intentioned, but it fell down a little bit because contributor comms were talking amongst themselves about what was going on. They wanted to get that out quickly in the interest of speed. And in the interest of speed, they decided to do that themselves. But they didn't actually talk to people who were closer to the process. And so the information that they were going to bring to users was outdated. And actually, I would argue a little bit scarier than what was actually going on. Because on our end, SIGS off in security, we actually didn't know that that was happening over by contributor comms. But what we were doing was actually coming to a lot of progress and pretty good conclusions about what was going to come next to replace pod security policy. We were feeling really optimistic about this. And that was going great on our end. We had no idea that was going on over there. And they didn't actually really know what was going on on our end because they didn't ask us. So the post that was going to go out had outdated information about how they're not really being a replacement plan yet. While on our end, we were making one. Eventually, we heard that this was happening. And we scrambled amongst ourselves to try to figure out what to do. We knew it was important to get communications about this out to users. But we were about to come to a conclusion that we thought was pretty good. So we put a hold on the blog post, talked amongst ourselves, looked at the blog post draft. It needed some work. It had outdated info in it that we thought might actually freak users out more. And in talking amongst ourselves, we actually ended up making the same mistake that contributor comms did. We figured we would rewrite the blog post to better reflect current events. But we didn't talk contributor comms that we were doing that. And so contributor comms, very fairly, were like, what is happening here? This was supposed to be out weeks ago. What are you doing? On our end, we could have communicated better with them too. So what we learned from this, we can learn that it's really important for people to talk to the people who are closer to the process, whether it's closer to the process of making that change or whether it's closer to the process of communicating out to users about that changed. As it turns out, this actually turned out pretty well. Eventually, we did rewrite a new post to better reflect current events. And we got it out to contributor comms who liked it. It was maybe a little slower than we thought it would have been. But it turned out well. Information that we ended up putting out to users was clearer and more actionable and did actually better empower them to be able to make plans in the future, as PSPs did get advocated. So what we can learn, everybody needs to talk to each other more. Because everybody messed this up a little bit. But also, everyone learned. We're all still friends. It all turned out okay. And I think the change, ultimately, although a little slower, ended up being communicated better. Speed isn't really the only factor at play. So Dr. Shim taught us that we needed to talk to our community. But this showed us that we forgot to talk to one another. Remember, we're used to maintainers being a really small community. And it just is not true anymore. Kubernetes is big now. A lot of maintainers don't know each other, much less know what each other is thinking. Some of us have been here since the beginning, but some of us are pretty new. Like me. Hi. A whole lot of us are new. This incident reminded us of that fact. And hopefully this time we won't forget it. Because we're well beyond the point of being able to assume that everyone has the same context. And we narrowly avoided disaster here. If we had talked to the relevant SIGs before getting to work, we might have saved ourselves a lot of time in grief. We did avoid it, though, and that's important. Because we learned from the last two disasters. We saw the problem and we course corrected before the ship hit the iceberg. Instead, we just kind of clipped it. There was some damage, but it's just scratched paint. And if we learn from this one too, we get closer to being able to just sail. And we learn a little more every time. I want to point that out and keep pointing it out, because it's true. In the interest of continuous improvement in the grand DevOps tradition of learning from incidents and failures so we can do better next time, we're doing that. We're doing it right now. And that's awesome. We're doing it together. We've gone through a brief history here of some communication breakdowns between maintainers and users, maintainers and maintainers, about how to communicate changes. And we're hoping that we can take these lessons and learn how to do that better in the future. So how can we do that? How can we take the lessons that we've learned from this to improve going forward? We have some ideas. One, we need to talk to each other. And then we say each other, we mean all of us. Fellow maintainers need to talk to each other. People need to talk and work together across six. We need to talk to our end users. We need to talk to developers if we're administrators or maybe the other way around. Really, the whole community needs to be understanding each other's perspectives and talking together. This is especially the case for understanding the perspectives of the people who are most affected, which could be any number of people, depending on what we're talking about. But it's especially important to hear from and understand where they're coming from. We need to be careful of what assumptions we make. Other people know things that we don't know. And we have no way of knowing what that is unless we ask them. And we don't always know what we don't know. Other people won't always share our context, our understanding, our knowledge, or our concerns. We just need to ask, not soon, because assumptions make an ass out of you and me. Know that we're going to screw this up again, but that that's okay. This is unavoidable because we're not mind readers. These kinds of mistakes, they're normal. None of the things we've talked about today are a single person's fault or even really all that unexpected. This is a totally natural result of a community growing very quickly, and nobody should leave this talk feeling bad. Ultimately, we all want the same thing. We want to build better tools for our community and minimize negative impact. So let's learn to recognize communication breakdowns earlier before they become disastrous. We're demonstrably getting better at this, actually. The series of events from V1, beta one to Docker shim to PSP show us that we're learning and we're growing, and you all deserve high fives for doing that. If we look around at the Kubernetes community, at least metaphorically in this virtual setting, we can see how amazing this community is. We have built amazing things together. We have built this amazing community together, and that's not always going to go entirely smoothly. Mistakes do happen, but we can make them less likely to occur and less destructive when they do occur. It just takes sometimes taking a moment to think about where other people are coming from and to think about how we're communicating to people in relation to that. We have built so many amazing things together. We're going to continue to, and we believe that if we work better together, we can do better together and we can build a stronger community and a better platform. I believe in us. We're already doing better at this. We've made mistakes and we're learning from them. We'll make more. We'll continue to learn. I believe that we'll be able to continue to do better as a community. It feels really important to us to talk about these mistakes so that we can learn from them going forward, but I think we can. I think we already have, and I think we will. We got this. I believe in us. Thank you so much for coming. I'm Kat Cosgrove, and I'm Ian Coldwater, and thank you for coming to our KubeCon Talk. Bye.