 Thanks, everybody, for coming. This talk is called Stretching the CNI Boundaries with Service Mesh's Roadmap for the Future. So first, a little bit about myself. My name is Alex. I'm a software engineer at a company called Boyant. I'm one of the Lincardee maintainers, so I feel very lucky that I get to spend my day job working on open source software, which is awesome. Lincardee is a pretty cool project that I work on at Service Mesh. It's the only graduated Service Mesh in the CNCF, and lots of people use it, and it's very cool. So this talk, in retrospect, could have been called How Do Service Mesh Is Used and Configure CNI, and What Sucks About It, and also What Can We Do Better? So we're going to be talking a lot about some of the problems that Service Mesh has encountered, and how CNI can be used to solve those problems, and what additional problems that raises. And I'm going to be talking a lot about Lincardee throughout this talk because that's the project that I work on, but all of the ideas here are kind of broadly applicable to Service Mesh in general, and so I'll try to call that out throughout, but this is kind of applicable to all Service Mesh's, and I know in particular, I think Istio has had run into a lot of the same problems that I'm talking about here, and I think has even implemented some very similar solutions. So this is basically what we're going to cover. We're going to talk about what is Lincardee and what is a CNI plug-in to give you some background so that this stuff makes sense. And I'm going to talk about two problems kind of in particular that we ran into. What happens when there's multiple CNI plug-ins installed and there's conflicts between them and how do you deal with that? And what happens when a new node joins the cluster or restarts, there's some race conditions that can occur when that happens with CNI plug-ins, and I'm going to talk about that as well. Okay, so first and backbone, what is Lincardee? I'm not telling. Hopefully if you're here at 255 at Service MeshCon, you've at least heard of Lincardee. You at least know the concept behind the Service Mesh. And so for this talk, I really want to focus on how Service Mesh is used CNI rather than spend a lot of time on the background. If you are interested in learning more about this, there's a really good talk on Wednesday called Who's Packet is This? Life of Packet through the Service Mesh from Kevin and Doug. That's going to go into a lot more detail about the IP tables rules and the routing that Lincardee uses. I'm going to talk about that at a high level today and talk about how those things get set up, but if you really want to get into the nitty gritty of what those rules are and how they work, this is going to be a really, really good talk. For something a little more high level, I also have a talk on Thursday, which is overview and state of Lincardee. That'll give a higher level talk about what Lincardee is, how it's used, what is new in the past year or so, and kind of what's coming up on the roadmap. So I highly recommend both those talks if you want to learn more, but for now we're just going to focus on how Service Meshes use CNI specifically. So you do need a little bit of background, so I do have to tell you a little bit, which is that for the purposes of this talk, all we really care about for Service Meshes is that they are something that runs a sidecar proxy. Here's the Lincardee proxy in each pod. And that proxy is responsible for intercepting all inbound and outbound traffic from that pod and doing stuff with it. And so the way it accomplishes that is that traditionally there is an init container, which also runs in the pod. And that init container runs before any of the main containers, and its job is just to set up IP tables rules, which configure that routing. So anytime someone tries to establish a connection to anything in that pod, those IP table rules will reroute that connection to the proxy and the proxy can handle it. And similarly on the outbound side, whenever anything in the pod tries to connect outside of it, those IP table rules will intercept that connection and rewrite it so you're connecting to the proxy instead. So the proxy intercepts in both directions. And in order to do this, it needs some elevated privileges. I think it needs specifically net admin and net raw in order to create those IP table rules. And so the problem with this is that it means that all of the pods in your cluster that want to be part of the mesh need to have these privileges. And so most of the time that's fine. That's no issue. You really want to lock things down. And you don't want every pod in your cluster to necessarily have these privileges that can be a problem. And so the question we ask ourselves is to get around that, is there any way to set up these IP table rules ahead of time so that we don't have to grant these privileges to everything in the cluster? Is there some way we can get this set up for every pod without needing those additional privileges? And CNI, container networking interface seemed like a promising way to maybe do that. So what is CNI? So CNI stands for container networking interface. And it's basically the system where you can have these things called CNI plugins. And CNI plugins are installed on the host. And they're called by the container runtime whenever a new pod is created. And their job is to set up the network appropriately for that pod. So what that means can vary a lot depending on the plugin. Different plugins can do different things. They may set up the pod network. They may create network interfaces. They may set up firewalls or throttling or other types of policy. But basically their job is just to set up the network for that pod. And so we thought, well, maybe this is a good way to, a good place for us to set up the policy for, for Linkerty, right? Because Linkerty needs those IP table rules that intercept the traffic on the, going in and out. And so if we had a CNI plugin that would get called whenever a new pod is created and it can set those things up. And that way the pod itself wouldn't need those privileges because it would all be set up by, by CNI. And so to get a little bit more specific about what a CNI plugin is, it's mostly made up of just two files. So there's a configuration file and there's an executable binary. And those live on the node file system. And so anything that's kind of in the right place and in the right format on the node file system will get recognized as a CNI plugin. And whenever a new pod is created on that node, the container runtime is going to call that plugin and say, Hey, you're a plugin and you're supposed to set up the network. Do your thing. And you can have multiple plugins installed at once. So rather than having just a regular config file, you can have a comp list file which lists a bunch of plugins, which are installed. And those will run in what's called chained mode, which means that each of those plugins will get run sequentially whenever a pod is created. So right, like I said, we can have this Linkerty CNI plugin which configures the IP table's rules for a pod. And if we do that, then we no longer need that init container anymore because those rules are set up by CNI when the pod is created. We don't need an init container to do that. And we also don't need those elevated privileges. So it's great. And it works great. Linkerty has been doing this for a long time and it's fantastic. But it does raise a question of how do you get that CNI plugin set up in the first place? Remember I said that that was made up of files on the node file system. So how do they get there? Who puts them there? So I think traditionally, when you're using CNI plugins, it's the responsibility of the cluster administrator to set that up. But it's a little bit involved. You have to take these files. You have to put them on the host file system or the node file system. In order to know how to do that, you need some amount of CNI expertise. You could make a mistake doing that. Where do you put the files? What format do they have to be in? And you probably want to integrate this with your automation in some way so that you're not manually doing this for every node. So it's a non-trivial task to get this set up. And one of the philosophies behind Linkerty is we wanted to alleviate as much as possible the kind of manual administrative work that can be automated and take that burden off of cluster administrators wherever it makes sense to. So we asked the question, well, what if a CNI plugin could install itself? So remember I said that all a CNI plugin is basically just two files. So what we did is we created this command in the Linkerty CLI, Linkerty install CNI. And you can pipe that to kubectl apply, and that will install the CNI plugin for you. And the way that works is it just creates a daemon set that runs a pod on every node. And what that pod does is it mounts the node file system and copies those files into place. Very simple. But what it means is that once you run that, those files will be copied onto every node. And then from that point forward, any pods that are created will be running on that node with that CNI plugin installed. And so the container runtime will call into that plugin and say, hey, there's a pod being created. I need you to set up the network for me. It's going to create those IP tables rules. And now that pod is all set up. And all that traffic will go through the proxy as intended. So this works wonderfully most of the time, but we did run into some problems, especially for people who were running multiple CNI plugins. We would get these error reports were saying, hey, I'm running with Calico or I'm running with, you know, Sillium or I'm running with whatever and things aren't working. So like I said, CNI is designed so that you can have multiple CNI plugins installed at the same time. And that's fine because they'll run in chain mode and, and they'll get called sequentially and everything is great. But if you have two different CNI installers that are unaware of each other, well, what are they going to do? They're both going to run on the node. They're both going to mount the file system. And you know, linker DC and I is going to copy these files into place. If you have something else, like for example, the Calico operator or something else that's doing a similar thing, it's also going to copy those files in and those files can conflict or shadow each other. They won't necessarily be changed because those two installers are not aware of each other. So since those installers are not kind of aware of what the other one is doing, you can get into the situation where, where they're overriding each other. And remember that when you just have a single CNI plugin, you have what's called a .comp file. But if you have multiple CNI plug plugins, you need to have a comp list file. And so since neither of those plugins are aware of each other, they, they don't know to create that comp list file. They both think they're the only one there and they step on each other's toes. And so the result is that you end up in these very broken network states where you think you have multiple plugins installed, but only one of them is running and you don't know why and, you know, anything that's expecting to be there is going to get confused. And so what we really want is we really want to be in a chained configuration. We want a comp list, but neither of these things know that. So the solution we implemented in Lincority was basically to become a CNI plugin manager. So what we'll do is that whenever you have that Lincority CNI daemon set that's installing the CNI plugins, what it's going to do is it's going to first look at that directory and say, Hey, is there any CNI plugin already installed here? If so, we need to rewrite its configuration, turn it into a comp list and chain those two plugins together. And then furthermore, it needs to watch that directory and say, well, if any CNI plugins are installed after I have been installed, I need to notice that I need to take those configurations and also rewrite them into the comp list and combine all the CNI plugins into a single chained list. And so this works, but like it uses I notify weight, there's a whole bunch of gross bash. And in fact, the majority, maybe even the vast majority of the code in the Lincority CNI plugin installer is dealing with parsing config files and watching directories and rewriting them and doing all of that kind of plugin management rather than doing its own job, which is unfortunate. So this works, but how could it be better? And these are just kind of some half baked ideas about how we could make this better. You know, there's a lot of stuff here still to figure out, but it would be cool if we had some kind of convention for how CNI plugins should interact with each other when they're being installed. If you're a CNI plugin and you're installing yourself and you notice that there's already some plugins there, what should you do? What should the conventions be around chaining to those or somehow integrating with them rather than just kind of blindly overriding? But there's a lot of questions there, like how do you decide what order to chain those in? Because of course, the CNI plugins, they do things to the network. The orders that they run in is very important. So how do you determine that order? How do you deal with atomicity if you have two plugins installing themselves at the same time? How do you guarantee that that's going to work out? Another potential idea would be to have this be handled somehow by the run time, to basically move this down to the infrastructure rather than having the linker to CNI be a plugin manager. What if there was something analogous to a plugin manager that would take all of those configuration files and know how to combine them in an intelligent way? But again, you still have some of these questions, like the order that they're combined is very important, so you need some way to specify that. Okay. And the other issue that I really want to talk about was CNI plugin race conditions. So this is something that happens when a node joins the cluster, and this can happen for a variety of reasons. You can have just a new node spin up because you're adding, expanding capacity or something like that. It also happens when a node restarts. So when a node restarts, it's going to leave the cluster, and then once it restarts, it'll join again. And so this is something that we saw happening a lot during node restarts as we would run into these problems. So what happens when this occurs? Well, first of all, the node comes up. Once it's healthy, it announces to Kubernetes, hey, I'm available. You can schedule pods now here. And Kubernetes says, okay, well, I've got some pods. In particular, I've got the linker to CNI pod because this is a demon set and we need to run one of these on every node. So here you go. And potentially I have some other pods that need to be scheduled. So I'm going to schedule those as well. And there's no ordering constraint on how these two pods come up. They're just both supposed to be scheduled there, so they are. And remember the job of the CNI plugin pod here is to copy these files onto the host system that represent the CNI plugin and are responsible for creating the networking rules, the IP tables rules. So if this pod happens to come up before these files are in place, when this pod comes up, the container runtime is going to say, hey, there's a new pod being created. Let me check to see if there's any CNI plugins I need to run to set up the network for this pod. No, these files aren't here yet. Okay, there's no plugin. So okay, pods created. We're good to go. And so this pod is going to come up in a broken state because it will have a proxy, but it won't have any rules saying to direct the traffic through that proxy. And so things are going to be very broken, especially because other pods in the network in the mesh will expect this pod to have a proxy because it does have one. And so that means that, for example, it can send traffic there that is encrypted with MTLS. It can send traffic there that's had a protocol upgrade that only Linkerty is aware of. But because those IP table rules aren't set up on that pod, all that traffic is going to go directly to the application. And the application is not going to know what to do with it. And things will be very broken. So the solution where we come up with in Linkerty, and this is something that's still in progress, I think the work is mostly done. It's just a matter of rolling it out. And so what we want to do is that we want to validate the routing on pod startup. So in other words, when the pod starts up, we want to validate, does the routing rule, do the routing rules here make sense? Are they what they're supposed to be? Have things been set up correctly? And if they're not, we need to fail the pod and start over again and try again. And this is very reminiscent of something that CNI itself actually does. So CNI has this command called check, I believe, where the container time can call into each CNI plug-in with a check and say, hey, I know you're already done setting up, but are things still the way you think they should be? Are you still set up correctly? And the CNI plug-in can respond, yes, things are good. Everything looks fine, or no, something's gone wrong. We need to fix things. So this is a very similar idea by having pods check their own routing and saying, is the routing on this pod set up correctly? Yes, it is. We're good to go. Or no, it's not. We need to restart this pod and try again. Oh, that is very hard to read. OK, so the way this works is that we have a validator pod, which is a validator container, I'm sorry, which is an init container. This says validator. It's very hard to read. And that runs in the pod first. So that's going to run before any of the main containers. It's going to run before the proxy. It's going to run before the application. And what it does is it sets up a server that listens on port 4140. 4140 is the outgoing port of the proxy. So what that means is that when things are running correctly, all of the traffic which goes out of the pod is redirected to port 4140 on the proxy. But of course, this runs before the proxy exists. This is an init container. So that port is available, and it can kind of listen there. And it's going to listen there as a test. And then the second thing it does is it tries to connect to some IP address outside of the pod. Doesn't matter what it is. And when it does this, if those IP table rules have been set up correctly, what they're supposed to do is they're supposed to redirect that connection back to the proxy, or in this case, the validator itself, port 4140. So if things are set up correctly, that should redirect back to there. And we can detect that. We can say, hey, we connected to our own thing. That means the routing rules are set up correctly. We're good to go. The init container can terminate successfully, and then the pod can spin up as normal. If this doesn't happen, if that connection kind of just goes through to its original destination and nothing connects to 4140, we know that something's gone wrong. We know that the IP table's rules have not been set up correctly for this pod, so we need to fail. And so when that validator container fails, the whole pod will fail startup. Kubernetes will kill it and try and start again, try and create a new one. And hopefully by the time that pod comes up a second time, the CNI plugin will have had time to install it correctly, and everything will work. So you get into this kind of eventually consistent state where things will, hopefully, after some number of restarts, work as expected. So this, again, this is something that works. It solves the problem. It's a little bit gross. It's a little bit messy. It has this eventually consistent nature where pods may have to restart several times if they are scheduled on a node that is in the process of coming up. And ideally, there would be kind of a cleaner solution. So what we would really love to see is if there were some way to add constraints on the ordering of these pods so that we could say, yes, we know that the CNI plugin has been installed. All the networking stuff is set up before we try to schedule any other pods on that node. And so that could be, there could be some kind of affordance for that in Kubernetes to say that, hey, these are CNI setup jobs or pods, and we need to let those run first, or more generally, some other way of specifying pod ordering constraints so that we can guarantee that certain pods are run and complete before other pods are scheduled. But as far as I know, there isn't anything quite like this. OK, so what we kind of talked about through this presentation is I gave a little background on how service meshes use CNI plugins in order to set up their IP tables rules so that they can intercept traffic going in and out of the pod. And talked about these kind of two big issues that we encountered, CNI plugin conflicts when there's multiple CNI plugins installed and they override each other's config. And the CNI plugin race conditions where when a node first joins cluster or when it restarts, you get into the situation where pods can get scheduled before the CNI is done setting up, and therefore those pods don't have the right routing rules. If you're interested in kind of learning more about the nitty gritty of how Lincrity in it works, how CNI plugins work, and really dive deep, there's a very good service mesh academy that you can sign up for. It's completely free. I think the next one runs from November 11th to 17th. It's very, very hands-on, so it'll help you kind of get into some of the details there that I glossed over in this talk. So that's very interesting if that is something that you're interested in. Boyant also offers fully managed Lincrity. So if you're interested in running Lincrity in production and you want some of that administrative burden alleviated, this should be of interest to you. Boyant Cloud can handle things like automatic upgrades and other administrative tasks. So if that's interesting, you can find the Boyant booth, or you can book a demo online. Other than that, I am going to be around to answer questions. You can also find me in the Lincrity booth in the Project Pavilion. If you want to come talk about Lincrity or service meshes, I would be more than happy to. And I think we have some time now for questions. Yes, we do. We have plenty of time for questions. Hi, you mentioned the solution you've got as little, so this is maybe just a possible jank. But the test that goes out and then test that local redirection is working, would that miss IP tables inbound redirection is not working? Or is that possible? Yeah, that's right. So it only tests the kind of the outbound direction. But it's kind of as there as a proxy, right? It'll tell us whether the rules have run at all. And the rules are intended to set up both inbound and outbound. So this kind of if we see that outbound is working, we can kind of assume that inbound is also working. It seems like a safe assumption. Thanks. That was a great talk. Thank you. The kind of issue of, I guess, all your critical demon sets not being ready until user applications come on. I remember reading a lot about it when I was looking to the storage stuff, because they have a very similar problem there. And then there was this GitHub issue and a cat floating around about tating nodes and all that kind of stuff. And I was curious about, did you think about that kind of idea? And then I guess it seems like we just kind of need a general feature to be like, these things are really critical for the node. Don't let me schedule stuff till they come up. Yeah, yeah, I mean, it's true that there are a lot of things that this kind of applies to. This was kind of the best thing that we could find that would solve this. But I agree, having this solved at the infrastructure layer would just be so much nicer. Go win once. Go win twice. Let's have a round of applause for Alex.