 So, this is a beginner level class. This is bits and pieces of things that I've done in training sessions over the last couple of years. And what we're going to do today is basically just deep dive into how to get the logs and other diagnostic info out of Kubernetes when everything has gone wrong. And you don't know where to turn. So about me, I've been in IT for 20-odd years now, depending on how you count. I worked for CoreOS, previously for Red Hat, Electronic Arts, although not doing anything, especially thrilling at EA. Currently, I'm a senior consultant for Otimo. We are hiring. This is my team lead, Sam, right in the front of the room. My blood type is caffeine positive, and there's my contact info down there. Should you care to contact me? If only to say, God, that presentation sucked. It was a waste of my time. First thoughts. So Brad Giesemann, who some of you may have seen his presentation, I know Sam has, if you read his slides, one of the things that he goes over is you have to kind of get in the right frame of mind for some things. So I wanted to give you some first thoughts when you kind of dive in to debug Kubernetes. First thing is, it's not rocket science. Yeah, it's a distributed system. That doesn't mean that it's actually hard to get the info that you need to get out of it. Something that we're doing is really new. We're just poking at the system and seeing what falls out. Some of the tools are new. Some of the pieces you probe are new. That's about it. The general idea is the same as it's always been right on back to, if anybody's seen this image before, this is the quote, unquote, first actual computer bug. They had a mainframe that was going wrong, and they basically interrogated the system, examined its outputs, and successfully debugged it. This is how it feels when your application deployment fails. Lights are lighting up red. People are getting woken up by phone calls, sirens are going off, and it's tempting to just kind of throw your hands up and go, whoa! Take a deep breath. Don't panic. Do we have any IT crowd fans here? Yep. You know about the Little Book of Calm then. Find the Little Book of Calm. Make whatever other advice from classic works of fiction and film you care to take. And then let's fix it. The first thing we have to do is gather info, and that's what this is about. Then you're going to form a plan. You're going to test and execute your plan. So how do you gather the info? Well, let's talk about that. The first thing you want to do, if you've done stuff with kube control, we've now been told that that's the official name, if you've done stuff with kube control, you've probably done a kube control get, kube control describe, those are nothing new. Sometimes that's enough. You do a kube control describe, you see an image pull error, you go, oh, I forgot to put the image pull secret, and of course it's not pulling the image. You go off and you fix it. If not, let's talk about logs. I know my humor is terrible. The first thing that a lot of people don't know about is kube CTL get events. An event is actually a resource type in Kubernetes. When you do kube CTL get events, it will actually list them out just like it would list another resource and give you a summarized view of what it's seen and how often and when was the last time it saw it. This looks like terrible formatting. That's because when you see this, it usually is terrible formatting like that. This is only compressed a little bit in line length. If you have the ability to widen your terminal way out, that's what you want to do. In this case, I've got something that's failing to bind to a persistent volume because there are no volumes available. If you do a get events and you see that, maybe that tells you, oh, I forgot to deploy whatever configuration I need to make that binding work. Kube CTL logs, no get on logs. I still do it all the time, but there's no get there. It's just kube CTL logs and then a pod name. You'll see things like this. This is fairly innocuous. This is just a pod starting up. This is a Weave Scope pod. You're going to see a lot of Weave Scope logs as we go through this because that's what I used for all my examples and if I have time to do demos, that's what my demos will be using too. If the pod has multiple containers, you have to specify which container you want. This one only has one container. Container logs. Most of the time, your container logs are your pod logs, especially if your pod only has one container in it. If something has gone really wrong on your cluster and you can't get the logs from the pod with kube CTL, you may have to SSH into your container host and get the logs that way. How you do that is going to vary by your container runtime. You're going to do it one way with Docker, another way with Rocket, another way if you're using something else. I honestly haven't had the time to look in depth into how CRI does it, but I know there's CRI control that you can use to get logs and things like that. Just keep that in mind. You may have to go back to the good old-fashioned SSH. You can also run what I think of as a debugging container, and I think somebody mentioned that in one of the keynotes today when they were talking about things. You can actually, SSH to a host, figure out where the process that's in your troublesome container is, and you can attach another process to that container's namespace. If you're running Docker, you can actually attach a whole container to that container's network namespace and PID namespace and all that. What that allows you to do is you can run, let's say you're on a railhost, but you feel more comfortable debugging things in a Debian environment. You can SSH to the host, run a Debian container, attach it to the namespace you need to attach it to, and then use the tools you're familiar with. Or in some cases, there may be a specific debug container that you have for your particular application. It has diagnostic tools that you can use to interrogate what's going on. Either way, and it's also a little bit of a Hail Mary, right? If your application doesn't provide good logs or it's logging, but it doesn't know what it's having trouble with, at least you can fire up a debug container and look at the environment, and maybe that will give you an idea of what's going wrong. Also, if your host is having issues, do you really want to be installing packages on that host right now? Container's a good way to just get something running. You can use the debug without having to make permanent system configuration changes. What are we looking for? The first thing you're looking for is, what is your normal behavior? And I call this normal, abnormal, because if you looked at this, and you didn't know that this is perfectly fine when you start up Weave Scope in an environment without Weave Networking, then you would look at that and you would start trying to debug this. And you might waste a lot of time trying to debug this when this isn't actually the problem you're having. Whatever the problem you're having is, it's not that. So know what your normal and your normal, abnormal behavior looks like. Networking issues. I don't really have an example of this one because it's so vast. I don't want to give you an idea that, like, oh, there's eight bullet points, and if I go through them, I don't have a networking issue. No. But the advantages, networking issues are the same as they've always been. So if you're on the host and you're debugging the network, you're still debugging the network in the same way you've always done, TCP dump, or whatever your packet capture method of choice is. Do you see the interfaces you should see? Did you deploy the cluster network? If you did, are the cluster network flannel or Weave Pods starting up? These are the things you're looking for. And sometimes you go back and you go, oh, there's a firewall that I didn't know about. So debugging your network flows is something that you may occasionally need to do. And the annoying thing about it is, if it's your network flows that are the problem, that tends to be when your applications just don't know what's going on because all they know is they can't communicate with each other. Pod startup issues. Are your pods getting scheduled? Do you see them in the list when you do get pods? If they're getting scheduled, are they starting up? Do they just stay in pending forever? We'll look at some examples of what that might be later. Are they starting up and then crashing? There's a crash loop back off or an image pull error or maybe an edit container is failing or it may take a while. It may have a pod that starts up fine, runs fine for a while, but it's leaking memory. And so eventually that memory gets too much and it just gets killed by the host because it ran out of memory. Service discovery issues. Now, if anybody's paying attention here, you'll notice that we aren't in 2018 yet. And so this is my example of, if you typo a configuration and you're looking for something using a service name that doesn't exist, in this case it's the Weave Scope agent is trying to contact the Weave Scope front end and failing because I put the wrong host name in there, deliberately for demonstration purposes, but this is what it looks like. Note that this is coming from the application again. We're going to come back to that theme of you do depend a lot on your application logs. So are you having DNS issues beyond that? Maybe it's not something you did wrong. Maybe it's that your cluster DNS isn't starting up or running properly. Did you deploy your app in the wrong namespace because namespace matters to service host resolution? Do your services have endpoints? If the service is created correctly and the app is contacting it correctly, is what's supposed to be behind that service actually there? Maybe there's nothing there. So you do a kube CTL get and kube control get endpoints on the service and see if there are any endpoints. Access control. This is one that's been biting a lot of people lately, especially in the helm community, because when helm charts started to do RBAC things, they initially were advising people to default to RBAC false, meaning don't create the RBAC resources. But then everything kind of flipped and for about the last couple of months, all the cluster installers have been defaulting to turning RBAC on. So now all these old helm charts that worked fine when they were first created and uploaded to GitHub are now failing because they're not creating RBAC resources and that's what they need to have. This is what an RBAC failure looks like. Again, this comes from the application. You don't really get to see this in the API server logs or anything like that. I mean, I suppose if you sat down and actually did a TCP dump on that IP for the API server, you would be able to see these requests coming in, but you don't really get a good easy view of RBAC issues in Kubernetes. So again, the application has to be smart enough to tell you what's going on. And here we see it's failing to list things because it doesn't have permissions to list anything. It can't even read anything out of the API server. The other thing that people are running into now since it's gone stable is network policy preventing traffic, just like people used to bite themselves with IP tables rules. I just saw somebody do it the other day in a production environment. Network policy, sometimes if your policies aren't written exactly the way they should be, you can cut off traffic that shouldn't be cut off. The other one I see, although not so much anymore because most cluster installers take care of it for you to a large extent is TLS issues. Either using the wrong TLS credentials or not having TLS credentials for something that requires them or certificates have expired or whatever the case may be. How do you fix it? Well, it used to be you had to go into the host and restart demons and change config files and all that. You still change a config file but Kubernetes does a lot of the cleanup for you. So you change your config file, you fix your container tag or your network policy rule or whatever. You do a kube CTL apply. Kubernetes will do a large part of the cleanup for you. So don't make work for yourself. Don't do something like go in and delete a bunch of pods in a failing deployment and then fix the deployment because if you delete those failing pods they're just gonna be respond as new failing pods. Let Kubernetes fix it for you. Preventive measures. This is where I get to rant for a little bit. Application design. How many developers we got in here? Yep, lots of it. How many of you developers are deploying applications on Kubernetes? Okay, not as many but still quite a few. Do your applications log diagnostic info and do they do so correctly? And I call out here just because I'm always amused by the fact that it is still in the Linux source code, LP0 on fire. Now if you get that error and you can get that error it doesn't actually mean anymore that your printer is on fire. It means that a particular combination of error conditions has happened. But that error message is still in the Linux source code. It still comes up for that combination of error conditions that it's no longer, the message itself is no longer relevant. They should be giving people much more informative error messages to what's going wrong. Don't write LP0 on fire error messages. And the reason I call that one out is because that's one that actually did approximately mean your printer was on fire back when it was written but it's been superseded by events. So always make sure that you go back and review the failure circumstances of your application and how it behaves. Are you logging enough? So maybe you're logging correct things but you aren't logging enough of them or in enough detail. I love applications that let me specify the level of verbosity. You do a kube control dash v99, you get everything. You probably don't want everything but it'll give it to you anyway. And even if you're sure you're writing enough detailed logs go ahead and make them more detailed. Do you have diagnostic tools and do you document them? I talked earlier about using a debug container. What good is a debug container if you don't have the tools to use in it to debug anything? So if your application needs specific debug stuff, I mean if your application can be debugged by a regular old TCP dump session, fine. Document that because even if you have the tools and they're not documented they don't do anybody any good. And then the last thing and I know I'm kind of preaching to the choir on this and I will be again in the next slide is how safe is your data and state in case of your app failure? We're working in cloud environments now. This is the thing we have to worry about. You can't just go oh well if the process gets killed then everything is resident on the disk. What disk? You may not be even on the same node when you come back up. Can you roll back from migrations? Have you tested rolling back from migrations? If you haven't tested your recovery plan you don't have one. Pre-deployment, I know there's been lots of talk and especially in the keynotes about automation tools. My favorite one is Helm. I participate in the Helm community. My demos are actually generated by Helm templating a Helm chart that I've been working on and have a PRN for. So you'll see like the Helm labels they're not necessarily relevant to the demo itself but they're there. It was just an easy quick way to get a demo going. And one of the things that the Helm community preaches and for good reason is factor out redundancy because every time you repeat yourself you give yourself a chance to repeat yourself wrong. And your environment should not only make both of those possible it should be the easiest way to operate in your environment. You should be telling your developers or your code developers don't do all that work to deploy manifest manually we have a Helm chart and a Helm chart repo just Helm install. That's easy. And if it's easy and if it works correctly then that's what people will do. So Helm, draft, the other one brigade is a scripting language that helps automate a lot of things. Consider the cluster auto scaler it can help you avoid out resources problems. And the last thing is test environments are not optional. I don't care how small your budget is you need a test environment. If you don't have the budget for a test environment you don't have the budget to run. And I know somebody has a reason why they can't afford a test environment. You're wrong. You need a test environment. Post deployment application validation tests this is another thing the Helm community has started doing they want people to write a test into their charts so that after the charts deployed you can do Helm test. You should do the same thing if you Jenkins deploy an application. After it's deployed make sure that it actually is running properly. Ongoing monitoring. I don't have to tell anybody in here about monitoring your applications, right? But I think some people and I don't think anybody in this room but some people's management of people in this room may not have gotten the message that you don't need to alert on every little thing. And especially in a Kubernetes environment a node going down at 3 a.m. in a properly designed environment is not a reason to page anybody and wake them up out of bed. It is a reason to log something and maybe throw an email out there but unless you actually lose the end user access to the application, don't wake people up. That's, I saw that, I saw a Prometheus presentation when I worked at CoroS and that was the guy's main idea is what matters is what the end user can or can't do. And if the end user can do everything they should be able to do you shouldn't be waking people up. Adopt the chaos monkey. Every now and then go into your data center and just yank a plug out. Whether you do that in a literal sense or a virtual sense somebody has a chaos monkey agent that you can deploy under Kubernetes cluster and it'll randomly kill pods and things like that. And the reason I say that is you show me a server with high up time and I'll show you a server that has unstored changes just waiting to bite somebody as soon as it reboots. Where you can get more help. Kubernetes docs, the API docs I go back to all the time. I was all over those API docs while I was researching for this talk. There's also other good info on the main task page. There's a sidebar there and it doesn't seem to work to link directly to specific things under the sidebar but if you go to the main page and you go under monitor log and debug there's actually troubleshooting clusters and troubleshooting applications and it'll recap a lot of this info. Kubernetes Slack channels Kubernetes novice Kubernetes users depending on how confident you feel. A lot of the time people come into Kubernetes novice and I actually saw this as the channel topic the other day. It said if you don't get an answer it may not be a novice question. Go ahead and ask it on the other channel. Look within yourself. If you've been working in IT for 20, 30, 40 years and you're going well now I'm in a cloud environment all that knowledge I have is just it's not relevant anymore. No it still is. None of this is new. Hen Goldberg said that yesterday right here on stage. None of what we're doing is new. We're just putting it at a different environment. This is my favorite networking document RFC 1925. It was submitted as a joke but it's one of those ha-ha only serious jokes. It was an April 1st RFC from 1996. If you go read it it will give you advice that you will be able to apply to your Kubernetes clouds today I guarantee it. The other thing that I personally find useful is take the time to actually sit down and fully describe the problem you're trying to solve because in the act of describing it you often go oh I know how to fix that. They call it rubber duck debugging or teddy bear debugging or whatever. Some people go and describe it to a co-worker who doesn't know anything about it so it forces them to describe it at a high level of detail. Either way just take the time, take a breath, sit down and think about it. Demos, let's see we have about 12 minutes left. I'm gonna skip over demos for now because I wanna give people a chance to ask questions. Have a couple of final thoughts. We just scratched the surface. Like I said I'm just gonna run right over this and I don't expect anybody to walk out of here and retain everything I just said. That's why I posted the slides online and why there's a link to the slides. Aren't there tools for these kind of things? Am I actually gonna have to go in there and do kube control logs and kube control get events? There are tools but what happens if deploying the tools goes wrong? What happens if the tools get deployed and they fail? What if something happens that the tools don't know how to deal with? So it's still good to have that base knowledge of how to do this kind of manually under the hood. But there are a lot of tools out there that are wonderful tools that will help you find and fix these kind of things and definitely you should go try out tons of them. Find out which ones you love that work well for you. Come back to kubecon next year. Tell us all about this awesome tool that you found and how it helps you recover from a production sub zero outage and how much you love that tool because that's what everybody's been talking about community the last three days. That's kind of what the community is all about. Go out and find things that work and come back and tell everybody else about it. Don't fall for imposter syndrome. This is just more believe in yourself. You know more than you think you know and once you start digging down into it you're gonna realize, oh, I know what we need to do here. If you feel like you're drinking from the fire hose it just means you actually know what's going on. And Ilya, I'm gonna butcher his last name Chick Regan on stage yesterday, so trust yourself. That's what this is all about. Just giving you enough knowledge to get in there, get on that command line and know that you can trust yourself to at least find out some of the info about what the problem is. Does anybody have any questions? I have a microphone if anybody wants to borrow it or you can just yell. No questions. Oh, one question. Yes, there is a way to do that. I don't remember what that is off the top of my head but I know that you can get the logs from a previous execution of, if you're doing it through kube control it's the previous execution of that pod. But if it's a single container pod it'll be the logs for that container. What's that? There you go. See, we're already sharing knowledge. Kubectl logs-p. Yeah, yeah. Anybody else? Oh, I should have put that in there, shouldn't I? And I don't remember it off the top of my head but if you go out there and you look up Chaos Monkey Kubernetes Agent in Google I think you will find it. I will make a point though when I post the updated version of these slides I will link that to the one that I was thinking of. So you'll get to that in just a second. Yes, if it's the cluster itself then depending on how that cluster was built you may be going in, you'll almost always be going into a node like connecting directly to a node and then either doing a journal CTL or looking at syslog if it's a syslog system or looking at the Docker logs directly of the containers that run the control plane components if that's how it's set up. There are different ways and so there isn't really one answer to that but yeah, if you have really deep issues with your cluster you probably are gonna have to go really deep into the host to debug and resolve them. Generally if it's something to do with the cluster you will see really severe problems like you can't kube control anything, right? And it will be stuff that's not even like bad credentials that you go oh I have the wrong context file or something. It'll just be like timed out or connection refused or whatever the case may be. You'll see if you do a kube control get nodes you may see nodes not ready. That's often because their networking isn't correct. They're not starting up the pod network like they should so they're not ready and they can't be used for scheduling. Once you start getting down into that the possibilities are vast but again, a lot of the knowledge that you already have from debugging host issues still applies. Anybody else? The most common modes that I see there and the most common install methods are running them directly out of a container on the host or actually running them as self-hosted pods and the system that I use for spinning up little demo clusters called boot kube and also kube-adm tend to run them as actual pods with manifests so you will actually a lot of the time be able to go into for example the kubelet logs and if it can't run the manifest to run the API server then the kubelet will have some kind of log as to why that manifest wouldn't run or what the error may be. Anything else? I think I have time for maybe definitely one demo maybe two so let me switch to my demo window. Make sure this is still up. Yes, I will make that much larger for everyone. There's the zoom on this guy. There we are, zoom and control plus. Okay, so just have it, I'm gonna shift it down one. There, these files by the way are also linked off of the slides which I will put up the thing for the, there's a QR code and a slide link that I'll put up at the very end of this. So the demos I can do, I can do an app communication error, a crash loop back off error, a discovery failure, failed mount error, that's a fun one because it just never comes up. Image pull error, pod stuck in pending or an RBAC issue, show of hands for app communication error. Show of hands for crash loop back off. Oh, I think that one might win. Discovery failure, failed mount, image pull error and pending pod, oh an RBAC issue. Okay, I will do crash loop back off and then if I have time I will do, let me see pending again and let me see RBAC again. Okay, I think that's a rough tie but I'm gonna do RBAC if I have time. So let me get my kube context here, cd to crash loop back off, ls broken. So these are divided up into broken and fixed and so I have a broken deployment here and we'll see it go into crash loop back off when I deploy it, kube cuddle apply-f broken deployment. Broken deployment, okay. Now if I do a kube ctl get, oh. Ah, see I already have an error. If I watch dash n, oops, I need to pass an argument to that. Oh, see it's already gone into crash loop back off. Now, what am I looking at there? I need to get the logs from this pod, not get logs, I have to remind myself of that. Ah, now this looks cryptic but what it actually is if you look at the manifest, this pod just runs a busy box container that runs a sleep command. And so if I look at that sleep container, broken deployment down here, I gave it an argument. Sleep R, well R is not a number of seconds to sleep. So it quite reasonably terminates with an error. Kubernetes says, well that pod terminated with an error. I need to restart that container because that pod needs to stay up. It's part of a deployment. But it's just gonna keep doing that over and over so eventually it goes into crash loop back off. The solution to that is to fix your deployment. And the way I do that in this is, I'll just scroll down manually. I changed it from R to, I actually put in the actual weavescope container, a thing that will run normally and I can kube ctl apply-f and now it's configured my deployment. And if I do what I did before, we can see the old one is terminating already. The new one is coming up and has already gone into a running state. So we fixed it, Kubernetes cleaned up after us. We didn't have to do anything but fix it and reapply. That's the great thing about declarative systems management. All right, one more. I'm gonna do the R back one. Let me clean up after this kube ctl delete. Let me just space there. Okay, this one's gonna be fun. All right, R back issue. All right, same thing we did before. I'm just gonna deploy the whole directory. It creates a couple of things. Now, this is the fun part. These things will start up. So initially it looks like nothing's wrong. Everything's, that's the old one that's still terminating from before but everything looks like it's running. So everything should be okay, right? But if I actually go into my weavescope front end which I don't have time to set up the port forward for right now, I will find out that there is no data being reported to it which means the agents can't talk to it or the agents don't have any data to give it. So what's the problem? Well, we know it's a problem with the agent just because we know how weavescope works. So I do a kube ctl yet log. No, I do a kube ctl logs, not yet logs. See, here we go. This is all the stuff that I showed you before. Failed to list, failed to list, cannot list services at the cluster scope. What's the fix? The fix in this case is I'm not updating any of the resources that I created before. I'm adding resources, a cluster role and a cluster role binding and then those errors will go away. And now if I do a kube ctl logs on the agent, well, it goes into a back off. It's backed off to a minute and 20 seconds now. So what I would probably actually do in reality is just go ahead and delete that demon set and reapply it. In fact, I have probably just about enough time to do that. Delete, this is the fun thing about working with Kubernetes. You get to just delete things willy nilly. Broken demon set, okay? Takes a while, has to kill all the demon set pods. This cluster is running back home in West Virginia, by the way. Okay, so now I can apply fixed demon set. Oh, actually it's still broken demon set because it didn't change. Do a git po, pick an agent, kube ctl logs and publish loop, starting. It's collecting info and it's already started to publish it to the front end. Which again, if I port forwarded to the front end we'd actually be able to see this nice little diagram of the cluster. I don't think I have time for that just because I'm getting a signal, time is up. So with that, I will very quickly give my grateful appreciation to my Otimo management, Sam and Raja and the other people at Otimo for supporting me getting here. Justin Garrison and Michelle Narali helped me with my abstract, CoreOS engineers and Red Hat trainers helped me learn how to do all this stuff that I just showed you how to do. And thank you for listening.