 This is a developer talk, as you can see. We are, I imagine that all of you are contributors or willing to contribute or people interested in Kubernetes. Is anybody that is a developer in Kubernetes can raise their hand? And people that want to contribute to Kubernetes, anybody here? Okay, so you are going to find this. Because this is a talk about a horror history that we had to suffer. And you may think that, well, Kubernetes is this fancy project, it's super cool, but there are a lot of horror histories and this is one of them. First of all, let me introduce myself. I'm Antonio Herr. I'm working at Google now. I was working at Treja before. And... I'm Swetha Repakula. Also at Google, I've previously worked at IBM on different open source projects. Okay, so first of all, just for the people that is not used to Kubernetes, to understand a bit better how the project works, how the project is organized. Basically, the project is government on the developer side is based on a special interest group. So we can have group or six that are horizontal, like API machinery or scalability, or we have another group that are vertical, like network, node, scheduler. Most of the people are used to these verticals in their companies and to the problems that this represents. What is the problem? These are not a problem in the network, no, these are a problem in your site and you have this wall going back and forth. And this happens in Kubernetes too. We have people that is signaled, we have people that is signaled and we have bugs and we never know who is responsible. But instead of fighting the wall, we try to collaborate to solve the problem. And this talk is about one of these examples where a bug in a component that is responsible for one thing was affecting another six and how do we solve it in collaboration. So this is how the, we're talking about how the project is organized in terms of development. But this is how the Kubernetes components look like. We have a QLED, we have an API server, we have a controller manager, we have Q proxy. And the components are more or less related to one six. But doesn't mean that the six is the owner. So for example, the QLED, most of the code is the responsibility or the responsibility of signal, but signal has a lot of bits there. Same happens in API machinery or the best example is the controller manager. The controller manager is a component that has a lot of code for different six. And at the end, we have all this structure, but at the end, we are coming to this conference, we are having fun together and we are just people. Most of us are paid and we have companies paying our salaries, but not all of us. And I would say that not most of us are paid just for working in Kubernetes after. So we dedicate a lot of time, our own time to work on this project because we like it, because we enjoy it and because it's something that we have fun with. So after this introduction, but they want to start to explain what's happening. So I don't remember when it was. So in 132, there was a big refactor on the node, on the signal, on the QLED, about the podline cycle. And this was 132, I think was one year old, but six months later or because single release until the users start to use the Kubernetes release, there's always a gap. But suddenly, we start to receive a lot of back. And you can see back that pod is failing to, pod with failed status IP address, they use it on new pods, but traffic is still going to all pods, because many places and people say, well, we have pod IPs, this is signal, okay? And then you have another one that is terminated pod and shoot down node, this is in service endpoint. Okay, service endpoint, this is signal two. So Kubernetes sending traffic to draining nodes. Draining nodes, this is going to be signal. And this is how we triage, I mean, you don't really know, this is not, we don't have a support people, these people open the back and we just go there and we try to figure out, well, who can help here? And just tag it, and sometimes we tag a lot of people and sometimes we don't tag on the usual orphan. But this is the idea more or less on how do we work actually, okay? So in parallel, as Antonio is kind of dealing with this in the open source, I came in from like, just looking at the customer issues that I was getting from working at CKE, and you can see, kind of fell into these three categories. Traffic was being routed to non-existent pod, though we would just get a bunch of complaints. Wait a second, why is my traffic going to something that doesn't exist? It's being black-cold and your load balancers are broken. Then the next one was, oh, not only is your traffic not going to the right place, it's going to the wrong pod, which can be a pretty big security problem if this happens to you, and unfortunately, we do have customers who face that. And then the last one, which was the most interesting thing and kind of the only hint of what might be wrong were we look at these end-point slice objects, and for those who are like familiar, that basically kind of has all the end points for a service together. And in that grouping, I feel like my voice is going in and out. The last thing we thought were that there are IPs that are showing up in here that just do not exist. Like I could not find what these pods were connected to. If we, so let's take like a step back. From those symptoms, I went and I was like, there is nothing wrong. The network is fine, end-point slice controller is fine. I said, we didn't make any changes in OSS. It can't be our fault. We take a look at this pod lifecycle diagram, right? People are probably very familiar with this if you've deployed a pod before. You enter in at the top, and go into that pending state before it's scheduled. Once it's scheduled, it might be unknown for a little bit, but once, you know, Cubelet has a probe and it's getting marked as ready, we get moved into this running state. And those top three states over there could be in a cycle, right? Like a pod gets restarted, it gets scheduled, it crashes, you know, all sorts of things. There are two states at the bottom, which we are called, that are called terminal states, succeeded and failed. And the key thing here is once you get into one of those states, you cannot go up to one of the other ones. Definition has been there for a while, but in 122, it was enforced in the code. And I do see no people here, so keep me honest. So that was part of the refactor, it was kind of making this change, so that way it, once you get into the state, it can't go back. And then really, your only option is for this to be deleted, which overall, that makes sense, right? From a pod lifecycle standpoint, that totally makes sense. And then just as like a quick kind of how does pod forgiveness work? Cubelet has, it goes, checks in on the pod, are you ready? Says yes, I am. Cubelet uses that, it says hey, let me update the API server. And then the bottom left component there is kind of the controller manager. So it has the endpoints and the endpoints slice controllers that are kind of packaged into that. And we read that information to determine and make service decisions or endpoint decisions. So I'm gonna deep dive in a little bit more. So the first thing is to talk about pod shutdown. And this is kind of before 122 and before the refactor. The typical scenario is a user will, user will delete the pod, so it's like an API call, there is a deletion time span set on the pod. That will get signaled to the Cubelet. I have simplified this diagram, there are other steps involved, but I'm gonna make it very simple. Cubelet basically signals to shut down the pod through the container runtime. The pod goes away. Cubelet will also update the status saying hey, this is no longer a pod, it usually goes into unknown. And another thing it does is it removed the pod IP. There's a slight variation of this and that's happened when you're doing evictions. We skip kind of the first few steps. And Cubelet usually will make a decision and say, hey, this pod needs to go, maybe it's because there's no more memory, we need more CPU, it's made that decision. So it will directly shut down the pod. And in those cases, again, the pod IP is removed. In those cases when a pod is evicted, it also considered a terminal state. So it is in that failed state usually. And if you look at the diagram at the bottom right, we have three different types of terminal pod. So the top one is completed, so that means it would have been in that succeeded state. The second one is an error, so that would be in that failed state. And then the left eviction is also in the failed state. And if you look at it, only the evicted one is the one which didn't have a pod IP associated with it. So if you were looking at this from a pod life full perspective or consistency, you would think for consistency, it should be having that IP on it. So that's kind of what the refacto did. It ensured that that IP was kept in that evicted state. We don't remove it because that's what the pods look like. It doesn't matter what state you're in, your IP shouldn't be removed. The other key thing that changed is probes are shut down before the pod is marked terminal. So that will be kind of key a little bit later on. So this is basically the change that happened and so far if you look at it from a pod life cycle standpoint, this makes a lot of sense. And it's actually kind of cleaned up some of the confusion or clarity in some ways because now you know what the pod life cycle should look like. So we're gonna now move into the network side of things. So services and endpoints and what happens when we're processing these events. The controller manager which has our endpoints controller, they watch for these pod updates and when it gets a pod update, we see oh look, this pod got deleted. We should remove the endpoint from our endpoint list. Cube proxy that's sitting on the node is reading these and once it reads it, it says this pod no longer exists. Let me remove it from IP tables. And this is kind of how our node networking works when you're depending on it. However, if we go into the buff, hopefully you can read that, there's two kind of key pieces here then. So I'm kind of digging into the endpoints code now to kind of just show you the subtlety of this bug. The first line is telling you if there's no IPs on the pod object, skip this pod. We don't care about it. We're not gonna include it. The second one, we're gonna ignore the tolerated, yeah, tolerate unreadying points but the next piece is the deletion test stamp. If the pod is deleted, obviously we don't wanna include it. We don't care about it, so let's skip it. So if we go back to our diagram, what this really means is if the cubelit is now actually only marking pod status terminal, in an eviction case that means that the pod actually doesn't get deleted and instead we also now don't remove the IP. And a key thing about evicted pods is they stick around in the API and they stick around until you get a threshold of it and then pod GC goes and looks and finds all these orphan pods and removes them. But there is a time when these terminal pods can exist in the ecosystem and that's kind of what's happening here is because now the pod status terminal not deleted and it has an IP, what ends up happening is we look back at the code and what ends up happening is it's actually failed, it passes both those checks and now this endpoint is now included into our endpoint list. And if we go dive a little bit more deeper into the endpoint code, we do have a few more checks that we're trying to do. Namely, it's this little if condition that is checking if a pod is ready or if the pod is not ready basically. And when we get to that ready state, that is purely based off of that cubelit update, the probe update. The second piece, which should pod be an endpoint, if we zoom in a little bit closer, you can see that it actually does check for that terminal state, but it expects that the pods are terminal, like pods that have a definitive lifetime because usually that's dictated by the restart policy. If your pod has been evicted, it usually means your restart policy is always not never or on failure or something like that. So what ends up happening is this pod is considered a valid pod to include in our list and the IP gets added to our endpoint object. So we put all of this together. What is actually happening? We have cubelit, so the pod gets deleted, it gets evicted, the cubelit doesn't update, it doesn't remove the IP, the endpoint controller says, hey, this pod should be included, add it in, but the one nice good thing about it is it's marked as unready. And then we get to Qproxy. Qproxy will remove it from IP tables because it's marked as unready. Now, we've kind of been saved in this scenario because our endpoint's object is incorrect, but Qproxy kind of, it wasn't correct and it had the unready, but Qproxy is able to kind of work around it. And in the end, this shouldn't really affect your routing decisions unless you're consuming endpoints and endpoint slices by yourself and making other networking decisions based on it. So overall, this is kind of, this is a bug that exists but doesn't affect us too bad. However, the story kind of gets a little worse so we'll pass it back to you, Antonio. Mm, okay. Okay, so I think that by now it's more or less clear how it did work, right? Everybody uses services, everybody uses endpoints and as you can see, there is a lot of chain reactions in between. The other consequence of this change on the public cycle was that we have another nice feature for services endpoint that's called terminating endpoint. Most of you that wants to deploy services and want to have zero interaction or something like that, you just don't want the endpoint to go out of the service just when you delete. You want to have some grace period so that fault can receive traffic, okay? So for that, in Syngnegor, we implemented the terminating endpoint feature. What does it mean? Just the pod state is no longer binding. So it's not ready and ready. It's ready, it's unterminated. You can send me traffic if you're ready so we have this special condition in the pointer-lized controller that is allowing to send traffic to pod that are terminating. And one of the consequences of this refactor was another bind. The problem was that during the refactor when the pod was going down, we forget to keep the prox. So the code was when the pod is going down, you have the prox that has been before to say, is the pod ready and the pod should I go there? Am I ready? But the problem is that during this shutdown, the prox were removed. So the cubelet couldn't know what is the state of the pod. So what happens in this situation? So the problem is that in this situation, the endpoint controller assumes that the pod is always ready, meanwhile it's terminated. And what if the pod was already there or is not ready, it's not reflected. So you end sending traffic to a pod that is there. And this means for people with ingresses or other kind of services, you are going to have four, zero four, or connection drop. The fix was really simple for the people that know of course. And it was just don't just keep the prox working and moving while the pods are terminating, okay? So, and after this, what are the lessons that we learned? I think that we are going to fast. Okay, we have more time for questions. So the lesson that we learned is all these behavior changes can break codes. And what's important with this? The important with this is that we need to establish contacts on the code. And the best way to establish contacts is adding things. Right now until 132, it wasn't clear how the pod like circuits will work, the concept of pod terminal, although those concepts maybe they were clear, but we didn't have this. And the rest of the people, the rest of the disease were within the system based on different hypotheses that when the code enforced, the concept started to break. And as you can see, everybody uses services, but only a few people really understand what is the whole chain of events that you need to, from when you create the service to the when the pod receives the traffic. And what does it imply that when you have a broken cell phone bar, things are very complicated and you need help from other people that has the knowledge in their area. So this is impossible that one person can solve this kind of box only. We don't have anything else. Right now, this was a complex box. This affected a lot of people. This was solved with the effort of a lot of people. Some of them are in the public. So we don't want to take the credit for this only. I don't know if you have some questions, so whatever you want, we are here. That's all. I don't know, okay, this is not work. What's the best way to get attention of another SIG? Because I know we don't always jump immediately on problems or things that other SIGs, you know, we try, but we don't always do that. So what have you found is the best way to engage another SIG to help collaboratively solve bugs like this? Well, I think that this goes down to this picture. Connection. I mean, I know Porte that was working in this team and I knew that he was working in the area. So I reached out. So that's the thing with open source. You don't have verticals or structures or procedures to communicate with others. And that's where, when it's important, this friendship and these events and everything is, you need to get to know the people and you need to get to us. And don't be afraid of asking, I mean, all of us commit mistakes and add bugs and I don't know and ask naive questions. So don't be afraid, just ask. I, for example, are very verbose people. So I go to a channel and I keep asking on the channel. There's always one person that is going to reply. And if it is not going to reply, you need to think, maybe I'm doing the wrong question. Maybe this is not the right channel, but I don't think that there is a magic one here. It's people, we are people. Just know each other, just a question and I don't think nobody's afraid of us. But there is no process. I mean, you can open a niche to any kind. I mean, we have the boat crossing a lot of places because we're crossing. Just use these opportunities. Travel or use the meetings, the weekly meetings, use the mailing list, whatever you are more comfortable with. Just let's meet people. You know, so that you want to. I have the same answer. You have to kind of, I mean, it's both a blessing and a curse because unless you know people, you don't know who to reach out to. But I think I look really like what Antonio said, just ask questions. I have not met a single Kubernetes contributor who did not answer my questions. So I would continue to think that that's probably your best solution. And somebody, if you start messaging in the larger channels and doing some at all, maybe not always recommended, but every now and then you will get a response. I think Antonio's suggestion too about, you know, if you need to, you could just go attend that SIG meeting and highlight your bug. We know that's happened a number of times for SIG network and other ones. Also true, yeah. Sorry, nobody has doubts about them points. I cannot believe that, I mean. Also, thank you, I mean, if you are users, before the next question, if you are users of Kubernetes, other thing that's important for us is, we don't have a QAT, we have users. And I mean, it's sad to say, but users are the first one that... CIs are QAT. Well, CI is something that I don't want. We have a SIG testing meeting tomorrow that I'm going to talk about CI. But I was thinking for the thing. This is our feedback and this is you users. Use Kubernetes, report the bugs and help us to make this better. Sorry, they can go ahead. I was wondering if you could detail the process for navigating like inter-organizational communication barriers. So I know you said you were at Red Hat at Google. Do you have any? Do you could like describe that process? Are you asking like how Antonio and I collaborated? I think so. So I mean, I think every project is a little different. For me, I think I'm relatively newer to SIG network than Antonio is. And I just started attending the meetings. We started with some names. I will say, once you learn some names and you come often, you kind of feel like, oh, I can reach out to them, ask them questions. Antonio also reviewed quite a few of my PRs and then you just start reaching out on Slack. And once you develop a relationship, it kind of continues. Like it's not going to end. And then through Antonio, I met other people. Also working at Google, there are other contributors. So I do, you can meet someone who'll be like, oh, talk to so and so they might help you. So if you are not a contributor who has this big group, I would say first, go to that meeting and just kind of see who are the faces, see who are the names. Typically, I think most of them start off with triage and like you'll see people's GitHub names. You'll, I think a lot of people also try to make it really easy to find them on Slack and then the name that they have on the meeting. So in general, I think people are open to it. And you just, that first step might be hard, but definitely just go ahead. Even if you say, hi, I'm interested in this, I'm pretty sure you'll get a pretty welcoming result. I mean, I've seen that all of us in Kubernetes and in most of the open source projects, you really don't look at the company of the people. I only know the people by the gig have hands. So I don't know the right person, but I really don't know the company of the people or I really don't care. I mean, we have the same goal. It's fixing the bug. Actually, this bug is happening at Rehab. It was reported internally. It happened and we were reported internally and we happen to work in the same bug together. I mean, with different origin, but with the same goal that is fixing Kubernetes, not fixing our company. I mean, we have to fix our company things, but the conclusion is that you forget about companies and this thing is about contributing and make it feel better for users and, oh, okay. Yeah. All right, first of all, awesome presentation and it's a really awesome summary of all the issues here. I had a quick question. So one of the interesting things I think this arises is like there's certain invariance and kind of guarantees in Kubernetes around different life cycle, right? Whether it's network, end point slices, pod lifecycle, et cetera. I'm kind of wondering what your suggestion is and what you would find helpful is like how we communicate those guarantees across different states and broadly in the community. Is it just like documentation or how do we kind of specify like, this is the life cycle. This is when this field will be set. This is what you should depend on if you're a management controller. I think there's an easy answer just because I do think documentation is the first step. If you, components you own and this is the behavior that you've defined, at least there's a query there. If somebody asks you something, you can point to them and say, hey, this is the thing. And it's great if documentation stays up to date as the code, but code is sometimes hard. In terms of spreading it to other consumers of your API, that's really difficult. I think one thing we learned is from SIG network, we're dependent on pod life. So we added a test so that way it can invariant changes. We can either like, you know, say no, no, or we can, we're aware when we're coding. And I think it unfortunately follows a little bit to the consumers, right? You need to be aware of the APIs you're using and you need to make sure it can make an assumption. You either have it tested or you have it very clear. And obviously now I can tell and say, hey, David, can you go and make sure any changes to these areas let SIG network know. But until something like this happens, I don't think you knew about this dependency and even we didn't know what's about this dependency. So it's like, I don't think it's a panacea to all problems, but I think documentation is a good first step. I am more radical at tests. I mean, tests and tests and you have something, you put a test, if somebody wants to use something that's not the supposed way to work, it's going to fail the test. And documentation is great. And we have this pod life cycle is well documented. I mean, it's pretty well documented. But the problem that we have here is that the contact wasn't enforced. And one controller was, you know, making the interpretation of that contact, but it wasn't enforced. So I think that is in this case, in hindsight, is a problem in SIG network. We didn't have a test to cover this. We assume that the points and no life cycle was this ideal thing and I mean, we have to do both. We have to document things to know how it works and then we need to add tests to know that we don't regress. That's what I say, it's not magic. It's test and documentation. We have that. Makes sense. Thanks. Yeah, thank you. This was a really cool presentation. And I think that you mentioned that, you know, there was a disconnect between the, like testing and network six here. And I was kind of curious like, how many people are active in multiple SIGs at once? And then how do different SIGs like prioritize like their issue backlog based on things that other SIGs bring up? I mean, I don't even, can you rephrase? And I didn't get to say what. Sure. Yeah, mostly the second one, but like how do different SIGs prioritize the things that they're working on based on what other people bring up that are outside of that specific SIG. Okay. I mean, that's the complexity of a product like this. I mean, I'm working in open source, I'm working with companies. You don't have, and that's the problem. It's more on the people got feeling. I started to work in working on this in Valencia. I'm with some people from here. I saw these bugs and started working on this. And some of them say, well, we had similar bugs. And then we started working more on this because we knew that something was going on. And it just, I mean, you know something is bad and you want to fix it. And this is something that is impacting the project today. I think that you can come with the idea. I mean, this is important. I want to delay other things. But if you tell me we have a backlog within some points and priority, I mean, I personally don't have that. We try to do that from the SIG. We try to prioritize. But at the end, I think that's more about the people and they got feeling and their willingness to fix things. Of course, if you have a bug that is affecting your product in your company, that's going to be prioritizing, but you know, because you are working in a different workflow. We're talking about a bug in Kubernetes that is not affecting any company. This is about people. I mean, I mean, maybe you are my friend and you tell me this is affecting me. And in the chat, okay, let me have one hour. Let me check it and I fix it. But that's, there is no workflow or something there. We try to follow some things, but open success this way. It's about people. Yeah, that's for sure. Yeah, thank you. Okay. Can we have another question or? Okay, well, we can follow up offline if you want to ask or whatever it is. Thank you.