 My name is Jared Hickey. I'm Principal Systems Engineer at Smartsheet. We're up in Seattle. We do a lot of interesting crazy things and have lots of fun doing it. Today, I want to talk to you about Kubernetes networking, talk to you about some of the pitfalls I've encountered, and hopefully get you so you don't see them yourself. First, I want to thank lots and lots of people. There's many people I'd like to thank. There's probably way too many that I can thank. So one person I do want to thank though is Eric Sidham from Tigera. He's the one person that really helped me through and got my first network stack up and running under Kubernetes. He was patient enough to sort of guide me through and show me the errors that I was having. Definitely a good resource out there. I also would like to say during the presentation, if you have questions, I'd like you to hold on to them to the end. There's a lot of information here, especially with a 30-minute timeline. It's going to be awfully hard to fit it all in, and hopefully we'll fit in. And if there's time at the end for questions, we'll do that. If not, I'll be happy to sit out in the hallway and talk to people and answer questions there. So with that, let's go ahead and get started. So this is a fairly typical Kubernetes network setup. Principally, there are three networks involved. There's your physical network, the one that everybody's aware of with the switch and computers are plugged into. But there's two virtual networks that occur. One's the pod network, and the other is the service network. On the pod network, each pod receives a virtual interface on the host, a virtual ethernet interface. As a result, it sits out on the pod network address separately. We'll see more of that here in a moment. I've tried to annotate a few other things that go on in here, such as you have the kubelet doing health checks to each pod. The kubelet actually sits in the US and talks to the pod through the virtual ethernet interface. We have the services, which is the abstraction that's sit there on the service net, but then have virtual end points to the actual pods. We'll see a little bit more of that here in a bit, too. But that's your typical stereo right there. Real quick, I wanted to touch on what network ranges you can use. I actually personally like to use the private addresses, the RFC 1918 addresses on the pod and the service networks. It solves a lot of the problems. A lot of networks, a lot of corporate networks tend to use the 10 network for their backbone. The standard Kubernetes install tends to use those address spaces, too. That's great if you're not conflicting, but a lot of cases it can either conflict or cause other problems. So that's why a lot of times I move them. That does cause issues when you do your install. We'll talk about that here in a bit, too. The one thing I would say is, no matter which way you go, keep your networking simple. Don't try to be creative. Don't try to take your, say, like 192.168, break it up, make part service and part pod. You'll drive yourself crazy and it just cause all sorts of problems down the road. So just keep those address ranges separated. You'll be much, much better. So I have a couple of key understandings that I've tried to incorporate through this that I hope everybody will start to understand. The big one is that every pod can communicate to every other pod on the pod network. So we don't have any funny things we have to do. It's just straight open up a socket and talk. So it's just like having a computer and talking to another computer through a typical TCP socket. So there's no port mapping schemes, like if you're dealing with Docker on two separate hosts, you have to go do your port mapping to be able to get one container on one host to talk to another container. That's not so in Kubernetes. And that's actually a benefit. Makes things a lot easier. So if we, again, the pods are connected through a virtual ethernet, actually connects into a bridge that then eventually gets out through the network stack onto the ethernet. We also have a Q proxy that runs. We'll talk more about that here in a bit when we get towards services. The Q proxy ends up controlling the law of the IP tables for moving traffic around sometimes. Looking a little bit closer at some of the network substrate. The network substrate itself, the stack itself is responsible for maintaining all the IP information. That's maintained the subnet information for all the nodes on the Kubernetes cluster, along with what IPs are in use. So whether you're using Flannel, Calico, or anything else like that, that's the responsibility of that stack to maintain those IPs. Since we're giving out IPs to every pod, every pod gets a unique IP. That's where we're just like on a regular network, even though it's a virtual network, it's just like on a real network where each pod is now able to communicate directly by just a straight IP address. Now, here's one of the critical things. This is probably my biggest failure is when I installed Kubernetes the first time. I actually had many installs and I actually was doing a lot by hand, but eventually by talking with Joe Bader at Heptio, he convinced me to use Cube ADM, and I will tell you Cube ADM is like magic. I love it to death now. The one failure is that I had not really understanding is when I ran Cube ADM, I did not know I had to tell Cube ADM my network for the pod and service ranges. So I would just run Cube ADM, then I'd start installing my stacks and nothing would work. So that's probably my biggest finding here is if when you run your Kubernetes setup, whatever you use, when you use cops or QADM or any of the other software out there, almost all of them will require you to tell what network ranges you're gonna use. And they're just command line switches, or if you use the config file, you'll listen to the config file and that will allow you to move forward. So once you specify that, things get a whole lot easier. Another thing to also keep in mind is that your pod traffic, all the communication between your pods is not gonna appear on your physical network, not the way you think it's going to appear. Of course, it does have to traverse that network, that physical network, it's actually encapsulated. So you won't be able to do a TCP dump on your physical interface and see the traffic like you normally would. It does depend on which network stack you install as to how you look at that traffic, but most of them you just have a separate interface. We'll see that here in a minute or two. And you can actually do a TCP dump on that interface. So again, understanding where your network traffic is going to be will help you in troubleshooting, but also keep you from being frustrated because you will get frustrated once in a while. I sort of added this in. I originally had this in and I pulled it out for a while with all my many rewrites. I've added it back in because it caused me some grief earlier this week in preparing some of the demos here for you. So I wanted to give you a little heads up. CNI, the container, I think it's container, yeah, container network interface. The container network interface is a wonderful standard. It tends to unify and make things a lot easier for bringing up your stack so you don't have to configure each stack separately. It used to be a pain in the rear end for getting a stack up and running. If anybody's ever done open stack stuff, you'll know what I'm talking about. That was very much a each stack was unique and how you brought it up. So CNI certainly solves a lot of those things. The interesting thing is CNI will also accept multiple back ends so you could have two or three different types of way of talking over the network through your CNI, but that's pretty rare. Usually you're only going to have a single one and that was my failure earlier this week as I actually had multiple configuration files in there because I was installing a couple of different stacks and some of the stacks were left behind. And so my Kubernetes setup was confused as to how to talk over the network. So that's something to be aware of if you're playing around. And actually I also found out that CUBE ADM, some of the latest versions of CUBE ADM is also installing some of the CNI stuff for you. So that was also conflicting with me. So really keep an eye on that. You may end up having to go clear out all your CNI configuration files as you bring your stack up to just keep it clean. They're located under slash etsy slash CNI. Many times they're put in a net.d directory so you'll go down in there, clear them out, read setup your stack and should be hopefully much better. This is a typical, well it's actually an older ones for an old stack, the host local stuff which you would have to maintain the subnet and everything so every machine would have a separate CNI configuration but most network stacks today, your CNI configuration will be the same on all machines which does help a lot and lend itself to configuration management and things like that. So I've got a little demo here. Let's see how I get this started here, there we go. So we start out with a brand. Pause here, but in a moment here I'll bring up a busy box and we'll look at the ethernet interface inside the container. It's still going, right? It looked like it paused, okay, let's try, continue. There we go, it paused on me, that's why it paused. It paused, so there we go, we get into a busy box and now we can do a interface, look at the interfaces on it and we see just a single each zero. So these are the sort of things you see and this is one of the things that always drove me crazy trying to bring stuff up is, everybody says, run these commands and everything's magical and you never know what to expect. So I'm hoping this will give people an idea of what to expect. Backspace is my friend most of the time. It's the one key that gets worn out on my keyboard. All right, so again we see the similar looking interfaces and our routing table looks about the same. The only difference is on one machine we see it route for one night out and since we just exit out, it didn't terminate the pod yet. We'll continue pinging, but we'll be able to go start looking at trace round and see traffic flowing or not flowing in some cases. So first I have to go over to the machine that the podge is running on, which is cubes zero one and then I'll run, oh, I looked at the interface first. So here you can see each of them and there we go. All right. So hopefully by now you're starting to get a feel of what network it looks like in a pod and how it traverses back and forth and so forth. Let's move on and talk about services. So key number, key understand number three, the services are crucial for not only service discovery, but also moving traffic between, excuse me, distributing the traffic into your pod. So if you have a service that you're running say like five pods against all five services are running, it's sort of a simple load balancer. So as being a load balancer, it does suit the traffic. The one thing it doesn't do like typical load balancer, it doesn't allow you access control. It doesn't say that I'll accept traffic from pod, this pod but not that pod or this address range and not this address range. There are no real traffic controls. Anything that the service receives, it will rebroadcast to your pods or one of the pods. So one of the secrets here is that the service VIP, the virtual IP, is actually a magic of IP tables. So you'll never see the IP on an interface. So you'll never see an interface with that IP on it, on a machine or in a pod or anything else, and you will never see a route to the services network. Anywhere. So it's all IP tables magic. The magic of the service discovery piece is cubes DNS. So cubes DNS will create an A record for every one of your services. I'll give you an example here about midway down on the third bullet, nginx.default.svc.cluster.local. So nginx is the service name. The default is your namespace. So if you put something in another namespace, expect that to be different. svc is always going to be there. And then cluster.local is the domain that you've given your cluster. Like I say, by default, cluster.local is your, what Kubernetes normally has, what it expects, that can be changed if you wanted to add it into a full DNS architecture if you wanted to. And that's about it there. I talked a little bit about cube proxy. Just want to touch on that for a moment or two. A couple of little things to know about cubes proxy. The IP table should be the default for cube proxy now on any installation. I don't know when that changed in Kubernetes world. I know at least 1.6, it was IP tables. Someone may know it's been earlier than that, but no one should ever use a user space unless you need to do some real debugging. So, now the thing about cube proxy is all it does is it's, principally they have to set up node port. We'll talk about node port in a second here, I believe. Actually, I think I might, probably should have talked about it on the last slide. Node port allows you to distribute the traffic to your pods from the outside interface. So it's a way of ingesting traffic into your cluster. So when you set up a service with node port, it will randomly choose one of the ports from 30,000 to 32,768, I think those are the default values. And so that port will get set up on every single one of your cluster nodes. And any traffic received on that port will be directed to your service. It does this in a, not even in a round rod and fashion, it does this randomly. So, I got a little diagram here. I actually had a better diagram at one point, but unfortunately I lost that image. The way it does that is with the IP tables, the new connection comes in, hits the first IP tables. And there's basically one out of three chance of hitting that node. If it hits that node, we're done. If it doesn't, if it decides not to hit the node, it moves to the second rule, which then has a 50% chance. So, if you haven't figured it out, it's one over three, one over two. So, however many nodes you've got, it's one over whatever node number it is percentage-wise as to if that rule's gonna activate or not. If it doesn't decide to hit node two, the catch all rule, which is the last rule, will send all the traffic through. That sounds great, it's a simple approach, but the simple approach is not always the best. There is a fallacy here that by distributing your traffic randomly, you can get HUD spots. So, if you get high traffic loads, lots and lots of traffic coming in, you'll start to see errors appear and then just disappear and just be instantaneous like that. You won't even easily be able to track it. So, just be aware if you're using the node port for ingesting traffic, if you've got lots and lots of traffic coming in, this could be an issue. So, with that, I'd like to do next demonstration of looking at services. So, we started out by bringing up a three front end pod. So, it's a little thing that I found again, which is fairly common. A lot of people don't really think about this initially. It's actually a DNS problem. If we look at our resolve.conf inside the pod, we see that our IP range is out of spec. It's not what our service range is. So, I'm going to go ahead and correct it here for a second, just to continue the demo. Now, I'll show you like a web server and I can't even spell, so that's a typical response. If you hit something that's not there, you'll get a 307, yeah, 307. And then I hit it correctly on the correct port and I get a response from the engine X that's running there. So, I exit back out and if you go look at how your cubelet gets configured, in this case, mostly cubelets probably run under system D so they'll be under the Etsy system D directory. There's actually a parameter that sets your DNS server. So, if you change your address range for the service network, make sure that you set your cubelets up correctly too. And so, finally, I'm going in here and I'm going to same up your load balances manually and things like that to try to get them back on regular ports. All right, moving on. So, the way we normally would get traffic in is through ingresses. Ingress resource allows you to basically take your service and expose it to the world. Expose it on port 80 or on 443 or whichever port you want to service it on. So, while node port was great initially just to get started, ingresses are really what you should be targeting to do any work. And so, the network or the Kubernetes ingress itself, mostly everything out there today is layer seven. So, your HTTP, HTTPS, most ingresses will terminate your SSL. That's about all that really is out there. There is a little bit, there's a bit of a hack to do layer four with engine X. There's also, if you're, I didn't put it on here like the F5 container connector, also allows you to do layer four through some special magic they do. It's not a standard ingress resource. But the ingress itself will sit out on your physical network, how you see E0 at the top of the diagram here. The ingress will then talk to the end points directly. So, a lot of people start to question why doesn't the ingress talk to the service and then let the service distribute it? Well, the reason being is that it can take a moment or two for the service to update. And so, the ingress is actually listening to the API server to get updates on the end points that come up and down. So, it just talks to it directly. It's a lot easier and a little bit more robust that way. One thing that should be noted too, of the three types of services that are out there, the cluster IP, the load balancer and the node port, cluster IP is the type that cannot be sent to an ingress. Cluster IP is really geared just for talking within the cluster. So, if you have two microservices, one that depends on the other and one that talks to another microservice, that's where you would use something like cluster IP to route the traffic from one microservice to another. Yeah, and then I just have to note about sort of, if you're gonna use a layer four protocol, if you don't do it through the engine exit ingress, the other way you can sort of do it is export that service as a node port and then on the next sort of load balancer, manually configure it to forward traffic to that node port. All right, sorry to sum up here a little bit, but some of the network stack choices you have, this by no means exhaustive, this is just four of the most common ones or popular ones. There's a whole lot more out there. Flannel, which is what I've been using here for the demonstrations is a very simple stack to bring up. It manages itself, it's turned it on, it just pretty much goes. Weave, or the WeaveNet stuff, I've actually never used that, but from what I understand, it's pretty straightforward to set up. It does scale better than Flannel. It's easier to manage than Flannel when you go to scale it. Project Calico, this is actually my favorite just because of some of the other features it has, but it's very similar to Weave and scaling. There's indications specifically when you go to BGP mode, it will scale much larger than Weave, but I don't have anything to fully document that and anything, but what it does do, one of the advantages it does have, it's one of the few and I've only found like two or three network stacks to do this, allow you to specify the egress rules. So all the stacks do adhere to the network policy API so you can have the ingress rules, but the egress rules of what you can, to make yourself more secure, the egress rules of what you wanna use. And then Ramona, which is something I was looking at for a while because of the security aspect, but I had to rule it out, but it does have some interesting aspects. It's much like the others, but one of the interesting things is it will actually allow you to take a service and expose it as a real world VIP. So in that case, instead of doing a straight node port and getting a random port, you can actually expose your service and bring it out as port 80 or any other port on a specific IP, that's a real world IP. So your services or external service can talk to it. So there's all the key understandings, summarized on one slide for you, and I'm pretty much done. I actually finished slightly faster than my run-throughs this morning so I can take a couple of questions if there's questions.