 Thanks everybody for taking the time to come to our talk about multi-network, a new level of cluster multi-tenancy. Today what we're going to look at is first of all, what is Kubernetes native multi-networking? And then we're going to dig into some of the use cases about how you would get multi-networking going. Maciej is going to take you through a user journey, we're going to get down to some of the brass tacks and show you exactly what it looks like when from a user perspective you're setting up multi-networking on some pods. We're going to talk a little bit about some of the ancillary pieces that are involved in this as well as some of the history behind it, like what predates the Kubernetes native multi-networking. And then we are going to invite you to get involved. I am Doug Smith. I am a Red Hatter who's an OpenShift engineer and I have been involved in multi-networking with Kubernetes for some time and I'm particularly interested in telco and performance networking use cases. And I'm joined today by Maciej. Hi, my name is Maciej and I am a software engineer at Google working in the GKE Anthos networking team. So what exactly is multi-networking? Today if you spin up Kubernetes and you spin up a pod and you go and exec into that pod and you do IPA, you are going to see a pod with a single network interface outside of the loop back, eth0. So you see that up on the top and then down at the bottom, you see that that pod via eth0 has connectivity to all of the other pods in that network and that's kind of a Kubernetes promise that you get. And then you would take and you would limit that connectivity using network policy. But you have to ask yourself, what if that single interface just isn't enough? So what Kubernetes native multi-networking does is at the most basic level allows you to spin up a pod with multiple network interfaces so that you can get connectivity to exactly the places that you want it to go in an isolated or performant fashion and we'll get into all of the different ways you might utilize that. And you might ask yourself, why is it that we are working on this now? We are all thinking right now that it's about time to talk about and adding multi-networking to Kubernetes as a whole. As a project it is very stable and adopted very well. We cross the chasm of adoption and we think that it is the right time to start talking and picking up more complex topics in Kubernetes as a whole as a project. That's why we are thinking of committing time towards that with the very important aspect that any introduced complexity to the whole Kubernetes project is basically a tax paid by all of us. So we need to be careful of how far we are pushing this so that we can live with Kubernetes in the future. So right now we want to show you a few use cases that we come up with our group on what sort of situation we want to handle with multi-networking. And the first one is the flagship that I got all of you interested in this talk is the multi-tenancy. And basically this is a multi-tenancy on a networking level. And basically the way you implement it you can even achieve a multi-tenancy on a physical layer having all your workloads connected to different networks. And in this case we want to remove any notions of primary or secondary networks. All those networks are equal. So we are envisioning in such a way that all those networks have the same capabilities that the current today's pods have. So there is no differentiation between any of the networks today. Another one of my favorite use cases is for network isolation. In a number of, there's a number of reasons why you might want network isolation. And for me the two that usually come up first is either for performance or for security and maybe for regulatory compliance. So I actually got my start in Kubernetes from an open source telephony standpoint. And when I first started doing it was previous to a Kubernetes 1.0 and I would mostly just use host networking for my pods which totally eliminated the possibility of a multi-tenant kind of situation like Machi was talking about. Network isolation is really important from a telephony standpoint. So one thing that you would typically do is you would have a management network which would be how you kind of interact with your infrastructure. You would have a signaling network which you would use for example to like make your phone ring, right? You get that signal says go and ring this endpoint. And then lastly you would have a media network which you would have the like actual sound of your phone call, right? And there's one particularly important reason in a sort of media type of network is that you use UDP traffic. So you want to isolate that so you don't get a kind of UDP congestion. So network isolation for me is a personal and important use case. So other use case we would like to cover is having ability to connect any of our existing legacy networks that we have in our environments but not only legacy but as well let's say a VPC in a cloud where I have some VMs running on it to be able to connect them directly to my workloads without any indirections with any proxies or any type of a lot of nothing in between those. And just like you might have existing networks that you want to connect to, you might also have existing workloads that you want to run in your cloud native environment as you kind of make this journey to cloud native. So you might have workloads that are still virtual machines and they might be virtual machines for a number of reasons. This might be vendor software that you've already invested in. It's shipped to you as a virtual machine and then you want to go and get this into your cloud native environment. Or it may be that because of resource constraints you have this legacy software that still runs in this way and it's going to take you time to refactor it to get that fully containerized. If you're a virtual machine user, hot plugging an interface is something that you would absolutely take for granted that you could do, that you could go into your management software and say give me another interface on this machine and point it at this network for example. We wanted to make sure that that is a first class consideration in multi-networking to enable you to be able to do that. So one example I would think of is let's say you have a BGP peering situation, you have your workload up and running and instead of shutting that machine down or that pod down you would instead hot plug an interface because BGP peering will take a long time to happen. So if you go and shut that workload entirely down you kind of lose your state. So in a more stateful scenario you might want to be able to hot plug an interface. And then another use case where we would like to for example introduce some sort of QoS tiers into our cluster where we would like to connect different pods into the different types of connections and configure those that direct connections on a per pod level. But simultaneously being able to represent the whole network as one item that's very important for us and that might be useful in the future forwarding things where what if we want to then be able to apply a network policy to that specific network. So we'd like to apply it once for the pod network but then differentiate that network in a various ways. In this case here I'm showing an example of differentiating a bandwidth limits on each of those connections for each pod but they're basically that boils down to being able to parameterize any types of single connection to a pod the way you want it and the way your implementation handles it. Another use case that we are looking at is for utilizing your performance hardware. So on your on premise situations or maybe even in your kind of bare metal cloud instances that you might have you may have performance hardware that you want to utilize. This is a particular challenge in this space because something that we need to account for is having workloads being scheduled to the proper node that has the resources available. So something that we of course love about Kubernetes is it's an awesome scheduler and it does a good job of getting your workloads to where they need to run and just in time. But when you have resource constraint as well you need to give that scheduler that awareness of which of these hardware resources are available. Even today during the keynote talking about GPUs there is a mention of device resource allocation which comes from a history of something we call a device plug-in which was also for GPUs but in the networking space we also have hardware considerations so we wanted to look at this. So this diagram demonstrates SROV in particular where you would have a physical function which would be like your physical element of a particular hardware device and then your VF, your virtual function which is the like virtualized slicing of that particular resource. So we wanted to make sure that this is accounted for because in terms of multi-networking you often want to be able to utilize this investment. So now we want to walk you through a user journey that we envision. A bit of disclaimer is something that we still are in works and discussions with the rest of the community. But this is what we came up with for the last few months with our current project for multi-networking in-sync network and what we think to propose to the rest of the community. So the first step is to introduce any object. We will need to have some sort of generic handle to represent a network. Something that we basically take for granted when we create a pod today it just being handled by our CNI on how to connect a network to that pod. Here we want to have some sort of representation for that network and how the CNI handles it. This is what we want to introduce is the pod network object that will be very basic and have implementation agnostic fields that then can pick it up and implement. The first field we want to introduce is just a provider on who is implementing the specific pod network and this gives us ability to create multiple implementers in a single cluster. So that's another capability we want to introduce. And then lastly we want to have a parameters reference. So that's a gateway for every implementation out there to introduce their own parameters to the pod network and then introduce whatever they want to do and however they want to implement this. The next stage is slight changes to the pod. This is a bit controversial but I think this is not something that we can get away without this. So we would like to create a pod level field with least of pod networks that we would like to attach to. Those are the objects that I mentioned before and here we would have a list of explicit list of the pod networks that the specific pod has to attach. Here one important thing is the list is optional. So we want to make sure this is a capability, a feature that is completely backward compatible and doesn't require any clusters that don't care about multi-networking to worry about this at all. So we want to introduce a notion of something called default pod network. And you can think of a default pod network similar to a default namespecies today in today's clusters. Basically it's a pod network that if you don't specify any networks I will attach to the default pod network. This is basically the same concept what we have today. When you bring up a pod it just connects, it just has some network and that's the default pod network. That's basically what we are trying to achieve to ensure it is backward compatible. And the default pod network will be auto-populated today in the pod with this similar to let's say a node name is populated by the scheduler in a pod when you don't specify that. And then we want to get some status out of this whole thing. So what we want to do is leverage the pod IPs field that we today have in the pod. There is one restriction to that field. It allows only two IPs in it to be listed. And each has to be from a different IP family before NB6. And with this we'd like to just introduce a new field to the list called pod network. Basically then we assign the specific IP to a specific pod network. But still we want to preserve the backward compatibility and slightly change the requirement where it's still just two IPs, one IP per family, but per pod network. That will give us basically the same backward compatibility but with the ability to expand it to multiple pod networks in a specific pod. And lastly we want to as well give some status to the pod network itself. So we want to leverage the standard conditions in the object to represent what is the state of the object itself. So we want to have a ready conditions which is very basic state whether the object is cluster-wide ready for us to use. And then additionally we want to introduce something called parms ready. And this is an optional implementation specific condition that will give the implementation ability to control the readiness of the pod network indirectly through that condition. Alright, and I would like to talk a little bit about some implementation specifics in this particular case. As Machi mentioned, this specification is totally implementation agnostic. So you can take this structure that we're talking about and you can wire it up any way that you'd like to with any kind of implementation that you would like to. So in this particular case, CNI itself is an implementation detail, right? And in some ways we might take it for granted in Kubernetes that CNI would be how you're going to plug your network in. And I think in a number of implementations you will still see CNI being involved and CNI is going to be generally kicked off by your runtime. Before I move on to the next slide, I just want to give a very, very, very short history lesson which was the conversation about multi-networking in Kubernetes has been going on for a long time. And back in KubeCon 2017 in Austin, Texas, we got a good group of people together to talk about, hey, what are we going to do about having multi-homed pods in Kubernetes? For me is a really great example of how KubeCon brings people together and gets these conversations going. And we decided we're going to form a working group, we're going to figure out a specification for this and we're going to eventually look for a common ground home for this particular work. So previous to this, there has been work by the network plumbing working group to create a custom resource definition called the network attachment definition. And you might be familiar with some of the implementations. I am a maintainer of multi-CNI, so it's the one that I think of first. But it gives you a way to express how you're going to attach multiple interfaces to your pod. So on the left-hand side here, what you're seeing is how this would be set up with the network attachment definition and potentially with multi-CNI is you're going to have a custom resource. So that means it's an additional extension to the API that you create as a user. And then you're going to say that you want to attach it using an annotation, which anyone can use one of those as opposed to the right-hand side where we're seeing a native object. The pod network would be a native component in Kubernetes like a cron job or a network policy, for example. You would have your pod network, right? You wouldn't create a custom resource for it. Now, something that to me is interesting here is because the work for the network attachment definition was done out of tree, it does have, number one, a reliance on CNI. CNI configurations are made with JSON. So something that you're going to notice here is that you're going to have a YAML file with JSON packed in it. And if you are a developer and you are walking through an object like this and then you have to parse a different format, that is going to just rub you the wrong way for sure. So something that is going to be quite an advantage from a developer perspective under having the Kubernetes native multi-networking is you're just going to use client go. And you're going to walk through that object like you would any other object. And you're not necessarily going to have to pull in a library to parse this or roll you or God forbid have to roll your own way to parse this. So that is going to be a major advantage and it should make your life a lot better when you're writing something like a controller or an operator to have like a richer way to manage networking within your clusters. So the question may be what exactly will happen with multi-CNI? Well, something that's kind of great about this specification is that one, if you don't care about this specification, you don't actually have to use it. It doesn't preclude you from using everything in a totally backwards compatible way. So this doesn't preclude you from an investment that you might have in using the network attachment definition. That being said, as the previous slide illustrates, I think that there's a lot of advantages that we can get out of having this be in-tree. One of the things we haven't quite touched on in a big way right now is services, right? That's a real challenge. So I'm working with the network plumbing working group to try to figure out, you know, what's our North Star? How are we going to move forward as a group? And part of that story of what our North Star is is what happens with our container run times. We've been lucky enough to have some contributions from Mike Zappa and Peter White from Microsoft. And one of the things that's being looked at is what needs to happen with our container run times. So there's a lot going on in this slide. It shows you a lot of the life cycle of what happens in the CRI. However, take a look at the left-hand side with the red boxes is kind of the new fields that we need to pass both to and from the container run time, which might also execute CNI. So for example, in a multi-CNI scenario, we have to deal with some inefficiencies because we don't have this, right? So this is a query that Kubernetes API to figure stuff out where this could happen natively. And we can get the kind of information and back that we need for the stuff like Marche was showing with the status. So this is just a proposal disclaimer, but really looking forward to see what happens next with the container run times. All right. And now I'm going to ask you all to get involved. Before we're going to do that, I just wanted to show you where we are today. So as you've seen, we showed you some use cases. There's something that we defined and we have few more of them. That's what we just show you. And out of those use cases, we managed to create and define requirements. There came up to be quite a few of them. And as a whole, it's becoming a big project to kind of handle. So what we did, we divided the requirements into phases. So we created currently four phases and we managed to define the first phase. So we have defined the API that we want to create out and how it's going to basically interact with POD and some basic interactions with scheduler as well. Right now for the next Kubernetes release, we're trying to get that merged and create an implementation for that in upstream Kubernetes. Our following phases will be to kind of build on top of that. We want to add more. We want to add some RBAC to this whole thing. We want to enhance scheduler to be able to introduce a selective availability of that, identify selective availability of the POD network across nodes. Then a next phase, which will be quite tricky because this is where we want to finally tackle services and network policies and how we handle that with multi-networking. And then lastly, this is where we want to look at some extended capabilities like hot plugin ability. So please help us with defining all those. Currently we have a Slack channel, SIG network, multi-networking. So all welcome and please help us define this whole big project and make it more robust and be able to work for every one of us. We meet weekly on Wednesdays in Pacific time zone. There is a meeting dog. This QR code will get you to it where you're going to have all the other links related to multi-networking. And lastly, this is our PR for the cap. You will see all what we talked about here in this talk and more. We need your all feedback about this to get this much better. Thank you very much and I think that's all I wanted to talk about. I think we have a few more minutes. If there are any questions, there's a mic out there. What is it on? It's not on. Can you? Yes, hello. Thank you for this presentation. I'm particularly interested in the use case of connecting to other applications that are running on-premise from within pods. Our application has a distributed nature with many components running on different platforms, different architectures. They need this ability to cross-connect freely, basically operate on an open network. Does this have any provisions for such connectivity? Basically from outside the cluster to the pods inside the cluster without needing to define services or having some separate agents process to facilitate that. Yeah, definitely. Keep in mind this is an API definition. So it all boils down how your implementation handles the whole connectivity. So the whole API that we presented here does not introduce the implementation itself. That's something that we need to keep in mind, right? So your use case definitely can be handled by the implementation that you would currently use probably and just be adapted to the new APIs. So basically you're talking about like a custom method plugin that appeals to like these new APIs. Yeah, exactly. Okay, I see. Thank you. Sure. Can you pass the mic to the next person? Thank you. So I know you said that the future of what this looks like in container runtimes is kind of TBD. What do you think that looks like as far as I'm looking at a host and I'm looking at the containers and processes running on that host? And today when I look at that, I have a network namespace for a process in the container. What do you think that looks like in the future with multi-network? Sure. I think that that will largely remain the same. So I think that you're going to see in terms of like the container architecture of the pod, you're still going to see that infra sandbox. You're still going to see your network namespace. I think that that should stay the same. But what you will likely see is if you were to look at that sandbox container, that will have multiple networks in it and that will be shared across the containers within that pod. So just like, well actually now I'm questioning myself, but I would think that for storage like a volume mount, you could have volume mounts that are shared across the containers in a pod, but that might also be isolated, but that should share so like you would have a pod, you have two processes running it in two containers. Both of those containers should see the network interfaces at that layer. But what I am hopeful that we wind up seeing is that the container run times themselves get more capability to actually do some of the work that say multi-CNI does today in the ability to make multiple invocations of CNI. So I'm kind of hoping to see something like that. Is that helpful? Yeah. So it sounds like in terms of what I'm actually seeing on the host, I'll still have one process. Well, maybe multiple processes with one namespace, but there will be multiple interfaces in that namespace. You got it. Yeah, absolutely. So you're going to have that one process per container, maybe multiple containers per pod, and then that infrared you got it. Thank you. Great presentation. How does this interact with the Gateway API? So this is something that's one of the integration points that we definitely have. Similar to what we are thinking of services and network policies, Gateway will be the next step. And it's, there are some ideas in my head at least on how that look like. One of the ideas and just throwing that in that a Gateway service can be done per network, which is keeps up then very simple, a single Gateway or service is assigned to a network and that's it. That keeps us simple. Today, it's all everything is assigned to default network, pod network. In the future, we can have those assigned to various different network, pod networks. Is pod network a cluster level resource or a namespace? It's a cluster level namespace. Yeah, that's what we're thinking. Yes. Would it be possible to have per namespace pod networks? Or is that kind of beyond the use cases that we're considered? I think we would have to talk probably in detail on why, because we would consider this as a core object like note, right? Keep keeping fine note. Why would you want to have a pod network on the namespace? And there needs to be some use case and discussion on why would you want to have that, right? I would assume. But right now we are thinking it's cluster-wide object. Thank you. Would there be a limitation on the number of interfaces each pod can have? Could you repeat the question, sorry? Would there be a limitation or how many interfaces maximum can we have? Okay, I see. I don't think we wouldn't impose any limitation. It will be up to the implementation and the node probably what they can handle, right? And again, up to the implementation. Keep in mind, depending on what you have in your path of configuring this, right? Maybe CRI will be in your path and that will have some limitations. From the point of view of our API, there is no limit. From today, we don't impose any limits, no. Thank you. This is actually somewhat related to the network namespace question. It's from a different perspective. I've used the host network C&I plug-in to a host device to move a device into the namespace of pod. This has one interesting limitation, which is a device on Linux can be in exactly one namespace at a time. It moves it out of the host. So this is a minor issue if you actually truly want that device in multiple namespaces, like host network does because multiple pods are truly in the host network namespace. Is there any idea that... And the other issue is pods correspond to the network namespaces. You get exactly one. So is there any idea that you could join an existing network namespace or anything in that space? Or is this truly separate from that? This is like the C&I plug-in. It's a multist today. Great question. A host device C&I is near and dear to me. I know it well and I have used it. That's an interesting challenge. And one of my own challenges with it is also like scheduler-wise, right? So like if you like, I guess the example I would give would be like, okay, telco-ran-lab. And you have like a couple of lab machines that have like a USB device to like emulate your radio network. And you want to move that device into your pod. And so you use host device C&I for it. But how do you get that pod onto the right box? That has been my challenge. But generally the answer that most people would usually give for that is that you would want to use a resource that is divisible, like SRIOV. The problem is I was trying to do proboscis to build up the Dix that it was SRIOV was in theory it might still work. But that's why I was trying to actually use host device I suppose. Do it VF. Yeah. And I totally feel your pain there because SRIOV is a pain. And host device can be like an easy answer for like, give me this device. It might still work. And that's what I'm going to try to do. So unfortunately I don't have like an awesome answer for you, but I love the question for sure. Last question. Great talk guys. Two questions. First, you showed the pod network object being a V1 API. Would it start as a V1 alpha? Yeah, definitely. We just showed one. Okay. Yeah. We showed what we would imagine in the future V1, but definitely it will have to start out as a V1 alpha. Definitely. Yeah. So the second question is would this introduce a V2 pod spec or would it be compliant within the... There was some discussion yesterday about it. Someone said that. I don't think that's possible. We don't know yet. I don't think we would want to do it. To be honest, I don't know. This is something that as I mentioned is has to be discussed with the community and see where we go with there. Yeah. I think it's a great path and great discussion. Thank you. All right. Thanks everyone. If you want to talk more, just grab us after this talk or anytime. Yep. Thank you all for coming.