 Hello, everyone, thank you for coming. This is the last session of the day, so I appreciate it more, if there was something during the noon or so on. The presentation can only be in the cloud, networking made to easy, so we will be talking about containers, networking. I'm Mateusz Kowalski. I also put Ben here for the credits because he's my teammate and most of the staff I'm talking here, it's not only my work, but it's never one person in keskome, ki jaz pomega. Kaj grebe, kaj sem pričo iz nekaj, zreč nekaj, nrč na kompletnosti. Vse je postavljeno na občunje s Birmetalj netvoreškljama. Ter Playtail je thermoč počer, pentru ki bilo sranje v Birmetalj netvoreškljama. Vedno, z adviser, če sem so čedljano po vse posebnej programa, je nekaj o Birmetal. Tako, tako. Tako, tudi o mesej. Tako, imam z RedHat za dve različje, tako, imam z RedHat projekti na različje. Zdaj mi bargondi z Akademija, Banking, Telko. Tako, však je zelo. Však je zelo. Však je zelo. Však je zelo. Však je zelo. Však je zelo. Však je zelo. And Benchmarking, because when you have a server and you want to know what is its performance in numbers, you need to run on the metal without any overhead. Not even talking about putting this out of the crowd and then Benchmarking, so that would be no sense. This slide is very generic slide. It shows a lot of stuff, but the key point, what I want you to get from this slide is that when about container and working, we may think about a lot of stuff, because you may have, you know, Kubernetes cluster, this cluster has nodes, it has some workload and so on, then you may have a lot of Kubernetes clusters. So when we are talking about networking, we can talk about cluster to cluster communication, we may talk about your end customer somewhere out there in the internet talking to your cluster. But you can go inside this cluster and then you have use cases like nodes in the same cluster talking with the same cluster, so talking with each other, you may have pods talking to other pods or let's do it even more complicated, you may have pods talking to nodes or pods talking to service. So all kind of combinations from now on, after this slide, we are not talking about anything that is external to one single cluster. So we will be talking about how you communicate with your cluster, how you communicate within this cluster, but we don't care if you have a lot of clusters and you want them to talk with each other. So any, you know, federation, this kind of stuff, it's not here. We are focusing on one cluster. We will not go as deep as OVS and this kind of stuff, it's on this slide just to show you that, you know, there are multiple layers and we could be talking about one particular nick, but we won't be so. So it's also not, you know, kernel-level networking presentations. And, yeah, of course, this slide also can be taken in the way that depending where you run your cluster, it may be more or less interesting. Of course, if you just buy, you know, AWS container as a service or whatever they name it, it's super boring, there is nothing interesting. If you run stuff on your metal in data center, then the fun begins. So, yeah, so the fun begins when we start looking from a very high level at your cluster. So you have a cluster, let's say it consists of only three nodes. It's a very small cluster, but, you know, one node is not a cluster. One node is my laptop. Let's assume that to call something a cluster, it must be three, so it's, you know, it's really distributed. And the very first question is, you know, we install Kubernetes, so we get API, we can run pods, and you can communicate with those pods, all this kind of stuff. But at some point, you get this very basic question. You know, I have three nodes. They all run Kubernetes API because this is how Kubernetes works. So how do I talk with this API? Because I have three servers, so they have at least three different IPs. What do I do? So do I need to pick a server to which I talk? This is, you know, it doesn't scale because you will add servers, you will remove servers. You won't be keeping a list of servers as an admin that you need to talk. So the very first thing to make your life as API user easier is to introduce this concept of API virtual IP. So that you will have one single IP that will have some magic behind that will make the traffic always go to some node. So that you always talk to one single IP, no matter how many nodes you have, no matter which nodes are up or down, you want to be able to always talk to one address later even to a DNS name and always land in your cluster. So we will get the API vip for this, but you know, this is only admin personal. What happens from the perspective of the user? Because at the end you run some application, you want to make money, so probably you run some web store or whatever, and your customers want to talk to this application and they have the same problem. So you run a pod, this pod will get IP address, probably you will know that you create a service and then you expose pod of IOS service, all this kind of stuff, but still, where does your service run? Because your service will get IP from a private pool, which is service network, yada, yada, this kind of stuff. Still, this is not IP address that you give to your customer because they couldn't even connect there. So again, we introduce once again one global IP address that you give to your customers and you can get the guarantee that whenever they talk to this IP address, they will land in your cluster. So yeah, of course, someone could ask here, now yeah, but we could do load balancer IP as a property of the service and what's then, but I will ask you, yeah, but you are not in AWS, you are in your basement, you know, in this building. So what is your load balancer there? So this is something that we introduce here and yeah, some technologies there, so we just keep alive and they take proxy, which I will tell a bit more in a moment how we do this and also why we do both. Why I'm talking separately about API Vip and Ingress Vip. This is bit of an implementation detail, but I think it's worth mentioning and this is because Kubernetes natively and OpenShift because we are RedHats, so of course I would like to say OpenShift always, but sometimes they force me to say Kubernetes, same stuff anyway. From the perspective of API, Kubernetes doesn't have concept of load balancing API. All that vanilla Kubernetes tells you is that you have API running on every server on specific port, thank you very much, rest of the stuff is up to you, you can buy service from someone who will manage that for you. So we need to do stuff there. With the Ingress, we are a bit smarter because as RedHats we already have Ingress operator and concept like this, I'm not sure who's familiar, who's not, but in general this is the first load balancer, this is the first entry point that you can get, so we have one less problem to solve there. So with API we need to solve two problems, which is how to expose the API and then how to load balance it. With Ingress we only need to solve the problem of how to expose it because the load balancing is implemented as a pods and that's why we plug it like this. Also, but this is purely about the traffic distribution, you run API, not in every node in your cluster, if you are doing bigger architecture, you have separated control plane from the workers and this kind of stuff, so then you suddenly realize that when you talk to API, you talk to different nodes and when you talk to your application, you talk to different nodes. So this has also implication on that front. And yeah, also something that we started to realize at this moment when we went from the stage of having API running on those three nodes and then we get, okay, let's introduce load balancer and this load balancer will be just sending traffic to one of those three IPs, depending what's up and what's down. Sounds as easy as it could be, but then you start to realize that there are some problems and you get this one particular customer who tells you, for example, that we don't want this IP address to float like crazy. In fact, you would like this IP address to never float to a different node unless our data center burns and this kind of stuff. So you need to start thinking about what are the scenarios that IP address can float. I won't be talking about keep alive the architecture, but in general how it works is that it exposes you an IP address which is then announced in our scenario in one, two broadcast domain so that you get for people from networking background, VRRP is the protocol. So in general it's a protocol using which nodes in one subnet can agree who holds the crown, the IP address. And it's well-defined protocol, it's not something that someone there invented, it's been there for like 50 years probably, maybe even 100 or 200 depending whom you ask. But the deal is that this protocol is very simple so that as soon as someone stops responding you assume they died. And in general it sounds okay, if your server died it won't reply to this protocol, it won't participate anymore, which means we should remove this server from the pool. But this is Kubernetes and there is so many more stuff because what happens in a scenario that your server didn't die, it just for some reason cube API stopped responding for a few seconds. It didn't even die, the process is still there, but it just got hiccups and it's not replying to few packets. So do we want a scenario in which such a hiccup causes the IP to float because it has a consequence for customer, I'm not sure if it's, yeah it's probably not written on a slide so I should explain it. Whenever we float IP others from one node to another we are terminating TCP session and it may sound easy but it has consequences. And most of humanity they don't care about establishing TCP session once again but there are people that really care about this and for them once they establish a session they want this session to be forever. And now we go into the part why we put HA proxy. So our topology if I go once again is that we have three nodes which run API. We put HA proxy in front and then we plug keep alive D to HA proxy. We don't plug keep alive directly to the API server running there but we benefit from HA proxy distributing traffic. Why? Because HA proxy dies less often than cube API and at the same time it can distribute traffic to some other node. So now if I tell you again the scenario in which my cube API got hiccups keep alive D is not going to notice it because keep alive D only checks HA proxy. HA proxy will be up all the time and if it goes down then it means that you know world is really going down but it will notice that cube API went down. At this moment it's not going to do any floating. It just for the packet that arrives it will send it to a different node and you will never notice it from the perspective of who holds IP. So in this scenario for your IP to float you need to get cube API down and HA proxy on that node down at the same time which from our analysis means that really your node is dying. If those two things go down at the same time you better take this node out of your cluster because it's nothing good. And yeah, of course there is this thing that you need to tune your health checks and time out smartly. You cannot just run defaults. You need HA proxy to notice cube API death faster than keep alive D but you can tune it in a very nice way so that you really have zero floating of the IP addresses unless node is dying. Yeah, so I have also something about the installation but I'm not sure if this is something that we should be discussing. So how many people are aware of something like bootstrap in place concept? Yeah, you are a person who implemented this so you don't count. Okay, so in general let me in two sentences tell you a story of how you install Kubernetes cluster. If you want to install Kubernetes on three servers you need to have four machines because you first bootstrap fake Kubernetes cluster consisting on one node on the machine which is called bootstrap which then spawns the real cluster in your three machines. After those three machines are confirmed to have Kubernetes running successfully you can do whatever you want with this one machine but the point is that in order to start the installation on three nodes you need to have four nodes. It can be your laptop, it can be stuff like this, doesn't matter but you cannot just take three nodes and say I would like to have Kubernetes on those three nodes. You need someone to kick it from the outside. And concept of bootstrap in place is addressing this issue. This presentation is not really about this. It does that if someone was aware of this concept this slide wouldn't be valid because it assumes that you have this separate bootstrap vm which if people don't do bootstrap in place and you don't have problem and you can continue with this story. So now I was telling you about this API v that you have a lot of nodes and you need to somehow manage how this works and now I'm bringing forth node. So you could ask me how do we solve the issue that during the installation time when the nodes are coming up and down and restarting and so on your API IP, so this virtual IP always targets the correct machine which would mean during the installation it targets all the time bootstrap machine and after the installation it targets only those three nodes. And the answer is as simple as priorities. So KIPA LiveD is so smart in the way that we configure it that we can say it almost in simple English. As long as bootstrap vm is up it should hold the virtual IP. So this is as simple as that and it's like really in KIPA LiveD someone does it, it's two lines of code. You could ask now if we go deeper into installation process but yeah, but Kube API on this bootstrap vm also sometimes reboots and whatnot. So don't you have a problem that your API IP escapes to those three nodes that are still in the process of installation? The answer is yes, it happens I think once or twice during the installation from what we benchmarked. But it's not a problem because it's only transient for a few seconds as soon as the bootstrap API restarted the IP goes back again there. So it's a problem that exists on paper. In reality it's not a problem. And on this slide, yeah, I don't have any pointer so we have to bear with me. You see that API v points only to the bootstrap vm post installation it moves to the control plane nodes times three. So exactly what I explained. And at the same time, once the installation is over we are adding ingress IP, the one that I told you about your application. API vip is for admin, ingress vip is for users. Once you have your cluster up and running then you get workers and you start introducing the ingress IP. I won't say anything more about this because it's exactly the same thing. It's easier because there is no HAProxy in between. So, yeah, but the same concept. You have worker nodes, whatever the number is of them you want to have and you do it. Some limitations of this stuff because there are, so scaling is of course the first thing because you notice that the virtual IP always stays in a single node at the time. So, if you start having a lot of HTTP requests to your cube API, it's all going to go to one single node. So, yeah, so this is something that can become potential issue. For this, our solution at this moment now is that you can plug your external load balancer. So, not to name vendors, if you have some fancy appliance that is very expensive load balancer, you can plug this one so then people will not be going to keep alive but to your own balancer. This solution doesn't work today for internal cluster traffic. So, whenever your cluster needs to talk to cube API because it happens, it will always go via hours but the scale is order of magnitude different. If you have so much internal cluster traffic to the API that single node cannot handle it, you probably did something wrong. The other problem is that keep alive the, in fact, the VRRP, it's isolated to a single subnet. This usually is not a problem if you deploy everything in one place but then we have customers that, for example, keep nodes separated from each other. So, imagine a cluster of three nodes and 100 workers and then they come to us and they tell us, but you know, not every node can talk to every node because those 10 workers, they are so isolated that they can only talk with each other but not with any other worker. And at this moment you see that Ingress IP is not doing the job anymore. So, for this we have some other solution in the Ingress that I'm not talking at all. If someone is making nodes or whatever, which I doubt, it's called sharding. It solves this issue, but it's something different. A bit about DNS because it's all about IP addresses still now and it's so ugly and people don't want to remember IP addresses. So, in general we have this concept that you have your cluster, so you will have API.cluster.com, then you will have webapp.cluster.com, you will have also API int, which is API internal and this is for this what I told you that cluster needs to talk with itself. But now, chickening problem, we are bootstrapping a cluster. So, I just got three servers to my data center, in fact, four because I need the bootstrap. So, now what? Seriously, am I trying to tell you that in order to install the cluster first, I need to go to my DNS, which I don't have because it's brand new data center. I need to install DNS somewhere first, create those records, and then start installation. So, no, it's bad. It's, you know, if someone told me to do like this, I would go to some other vendor. So, we solved this problem by running locally core DNS, as pod, but implementation detail, on every node from the moment it boots, creating the configuration for those well-known names that you tell us in the install configuration, so that from the very beginning of the servers being booted up, they have those records, which we tell you in the documentation, please go to your upstream DNS and configure them, but at the same time, we configure them ourselves so that if you didn't do your homework well, we do it for you, so that you don't notice. Of course, if you don't do it at some point in your upstream, other stuff will start fading and you will not get external traffic, but at least the installation succeeds. This is kind of, you know, we don't want you to do a lot of housekeeping before you even start the installation. A bit more implementation detail there for someone who ever opens resolve.conf file. You always see that, you know, stop generated by network manager, please don't modify, then you see a lot of stuff and some are usually, if you do system d resolve, you just see name server localhost and then you keep thinking how the hell is it working. So, one important thing is that when you run containers on a host, they, I would like to say always, but I cannot say that because someone will give me counter example. They very often just blindly consume resolve.conf of the host. So, now what happens? If I run DNS as a pod, it runs on localhost, I run some other container and I tell this container, you have DNS on 127.001 and I think that it will reach core DNS, this is not true because they are main spaced. So, the trick for this is to tell this other pod that your DNS is available on the external IP of this server where both pods live. If you ask, yeah, but then this traffic is going out and bug, this is nonsense, thank you kernel networking guys, traffic like this will never go to the wire. So, even if you don't have any network interface, this traffic will just go because of routing magic and network manager is there because network manager tend to modify resolve.conf every time you do anything with the network. If you unplug the cable, it will mess with your, it will mess with your resolve.conf and so on. So, we are using something that is called dispatcher scripts or hooks so that we always make sure that whatever happens with network manager, our config always contains the core DNS pod. So, we effectively run our own DNS because we know that customers don't do it often good. Okay, now we are changing the story completely and we are going very much into Kubernetes and how to run Kubernetes. So, I will be talking about concept called node IP and long story short, you run Linux binary, not even Kubernetes. This binary is going to receive a traffic. So, what happens at the level of Linux? You need to bind this application to some address, right? You can be lazy and you can tell it bind to colon colon and it binds everywhere but if you want to do stuff a bit more secure way or you name it, you would bind it to a specific IP address and to a specific port. So, let's go Kubernetes and if we are in a cloud scenarios, you go to AWS EC2, you buy a VM, this VM has one network interface with one IP address, you don't have any problems. But now question, if I have a server which has 10 network interfaces and 20 IP addresses and I want to run Kubelet with the API server, what do I do? Like, seriously, what is my topology and where do I bind? Because I don't want to bind everywhere. I don't want suddenly to all the interfaces to be able to receive traffic to Kubelet and all this kind of stuff. So, of course, Kubelet is not stupid and it allows you to do this stuff and this is exactly the, maybe, yeah. So, you see, dash dash node IP. This is the smart parameter that means Kubelet, please bind your API to this IP address and that would be very, very easy. But now, we go back to this scenario of a server with 10 network interfaces and so on. If I don't tell Kubelet which particular IP I want, it will do something stupid and it will select some IP. But at the same time, how is it going to know whether this IP is really the IP that you as an admin talk to the server. It has no knowledge. I have this knowledge because I was defining this API virtual IP and all this stuff. But Kubelet on its own is, you know, is as stupid as a single binary running on a single server. So, we need something additional and now, you know, OpenShift specific stuff kicks in and this is the component in which we call bare metal runtime CFG, but, you know, it's some runtime config and this is component that will be in fact responsible for configuring Kubelet to always run on a proper IP address because this component knows your network topology. So, it knows your virtual IPs, it knows how many network interfaces you have and all this kind of stuff so that we are bypassing the fact that Kubelet logic is not smart enough and, yeah, it's not something that we can always put upstream because we don't put upstream this concept of API virtual IPs and so on. You know, it doesn't belong to Kubernetes upstream. It's something that is specific to running stuff on prem or sometimes in a cloud. So, yeah, it looks complicated and it is complicated. I will not be reading you this design details. It's only that I want to tell you that you have dash dash node IP param. Then you have dash dash address param which I will not explain at all. Bear with this. It's confusing and you have to handle this and also you have dash dash cloud provider to put even more fuel to this fire and you have this table that tells you what's going to happen if you put some value or if you don't put any value and you know, it tells you that if you don't put any node IP then it binds everywhere. If you put one IP, it binds to this IP. If you put two IPs, you think it will bind to these two IPs, but this is a lie, you know, and this kind of stuff. And then we are starting to work on a cab that will allow you to say even, bind to any IP that is from this stack. So, bind to any IP, but only V4 and you know, it's really a mess. It doesn't touch the address field and it doesn't do anything with cloud provider. But cloud provider is a mess because this table is not covering everything. Because this table is not telling you is this configuration different if you use cloud provider or not. So, I want you to look into this line here. So, node IP has two IP addresses. One, two, three, four, A, B, C, D. So, you know, you assume that it will bind to this IP and to this IP, which sounds reasonable. It's a very simple dual stack application that uses one IPv4, one IPv6. And remember this case because it's going to fire in a moment. We are, for a moment, diverging from the Vips, sorry, from the cubelet itself and going into this heuristics of our stuff. So, if Vip is used, everything is easy to choose the node IP because we know where the API lives. You don't always have the Vip and then you have problem, but we saw this problem by saying, yes, you don't have virtual IPs, so maybe you have some string-trial balancing, but you have a default route and default route will always be there. If you don't have default route, then you won't run in a long term anyway. So, this is how we solve it. It's very small, neat, but some people need it. Of course, there is a lot of corner cases because what if you have a lot of IP addresses in the same subnet? What if you don't have any default gateway? What if you want cubelet to bind to multiple IP addresses or interesting? What if your IPv6 address is not really IPv6 and this will be, again, remember, that's the second thing to remember. The hint, I will skip this because it's purely override. Okay, go back to the first example which I show you, two IP addresses. I want cubelet to bind to two IP addresses. So, I do it. Node IP129, something, comma, if E80, I start cubelet, it tells me, oh sorry, failed to run cubelet, dual stack node IP, not supported when using cloud provider. Oh, thank you, cubelet. It's very nice that you told me in this huge table about supported configuration. So, the question is, so, how do you do it? How can you run cubelet with a cloud provider dual stack? The answer is you cannot today. Thank you very much, the stage is yours. No, no, like seriously, before 127 you couldn't and this is it. We are hacking this around by doing some tricks in OpenShift but in vanilla kubernetes you just cannot. The second thing from our stack, IPv6 addresses. This is RFC. I'm not going to read this RFC but I will show you in a moment IP address which is simple stupid. But this RFC gives us two classes of IP addresses. And the first one is comma comma IPv4 address like numerical. The second class is comma comma fff comma comma, sorry, colon colon, all the time yeah, my English sucks, whatever. IPv4, again. Slash 96 looks like perfect IPv6 address, this face is so big. Now, we go into programming class 101. All the standard libraries of most common languages have a function isIPv6, which you give a string which is IP address and it's supposed to tell you through or false depending whether IP address is IPv6 or not. Sounds easy. You go into implementation of this function and it's almost always as simple as if string contains colon return true. Well, those IP addresses perfectly much, right? Well, they don't really. Okay, it's visible, that's back. So, what I'm going to do, net cut dash L, listen on this IP address. So ffff colon and then IPv4 exactly like RFC says and some port and in simple English, it means net cut, please open a socket which will be listening socket bound to this IP address and net cut tells me sorry, no. Okay, but it should work, like seriously it should work. So, what do I do? I do S trace and I see something interesting. We won't be doing kernel, you know, kernel stuff here. I just want you to notice that in this command listen this IP address, it tries to open a socket which is in net six, so IPv6 socket and it fails. In the second command, I'm telling net cut explicitly, open IPv4 listening socket on this IP address, ffff colon like RFC and it succeeds. And I leave it as a homework. The name if someone wants to Google for this is IPv6 mapped IPv4 address. Thank you whoever invented this. I don't want to meet this person because it's ugly as fuck. But like seriously, this IP address is what I wasted probably like full three working days recently because someone fed my configuration which was IPv6 configuration with this address and that was why it's not working. And you see how far you need to go because net cut doesn't even give you a reasonable error. It just tells you bind to address invalid argument, like what's invalid there. So life is sad. So yeah, this is the last slide in fact, kind of summary. So I joined Bermetal team, yes, out of time, so one minute. I joined Bermetal team because I thought I would be doing physical hardware Bermetal and then what do I have to deal with? Like cloud providers because people install stuff on v sphere, people install stuff on open stack which to me is cloud, to everyone open stack is a cloud. But then you come and all those concept that I told you about, they are equally valid for any v sphere deployment, for any open stack deployment, for any private cloud. So I really in the past half a year needed to revive my understanding of what Bermetal and what on premise means from thinking that the only case of Bermetal is the server that stands there. To the case, the only non-Bermetal environment is AWS and it really breaks some concepts in your mind what you've been learning for the whole life but it is what it is. So this is the end where one minute after time so we can take questions offline or just go and get a view or so or whatever because it's so late. Thank you.