 Hello, welcome Welcome to our real-world troubleshooting tips for open stack operators My name is Anton Thacker. I work for Walmart and We also have two two of my co-workers here that will join us on stage Jeremy McCrory also Walmart and Scott Atkins from Walmart Jimmy's actually a core contributor to OpenStyleCanceable, so we have a lot of hopefully a lot of expertise on stage So we're gonna talk about some general debugging tips and tricks We'll talk about Monitoring and logging how to gather VMs about information about your VMs Troubleshooting DHCP specifically We'll talk about how Cinder creates a volume We'll talk about if you don't have centralized logging how can you look at your logs in some other ways? We'll do a little bit of troubleshooting keystone and we'll have some hopefully some room for questions and answers but this is this is a beginner talk it's We're gonna we're gonna cover some specific things, but we also want you to think about how we approach things And there's there's different ways to troubleshoot problems And so so we're we'll show you some patterns, hopefully on how to troubleshoot problems and And you can apply those patterns to Not necessarily Cinder for example, but for two other projects and other problems within OpenStack and To kick us off. We'll have Jimmy So really the first thing with any troubleshooting you want to have more information about what happened what went wrong? OpenStack has a couple really easy options for that both with the CLI clients Every client like Nova, Neutron, Glantz, Keystone They each have a dash dash debug option that you can provide and run in any command And what that'll do is give you kind of the the actual Curl API request that was made against the API endpoint the JSON data that went to that request and the JSON data coming back Really useful for seeing what's actually happening which API especially when starting out and learning how OpenStack works And how the reservoir endpoints are hit what they're doing And also every service has a debug option and it's config and the thing with that is Some of the services they'll be quiet 90% of the time Then you turn on debug. It's a constant flood of information. That's really overwhelming But it has that one critical piece of information that really tells you what happened what went wrong All right, so to kind of walk through this just an example of a Nova boot command It's just extremely basic with the Nova CLI With a debug option. This is just a portion of the information it gives you but it's showing like the endpoint the example Curl command that was run JSON data like I mentioned going back and forth the request the return code from that request which in this case was 202 successful Really important thing here from this debug though is the request ID So that's unique for every request you make against an API service And then we'll be following that through for all of Nova It's following the boot on the controller side. This is the Nova API command Nova API services log So that same request ID you're seeing the same JSON data that the client was showing you was sent You're also seeing the request that Nova API made to glance about the requested image for your VM And then some messaging between the Nova services, so especially starting out I think really with troubleshooting the most critical advice could give you is learn how OpenStack works underneath what services talk to which other services and Just kind of see where it's falling apart if it is and it happens Nova scheduler so that's the next point where it's actually saying like where should I put this VM? This is a really good example when when there's default logging just not debug This is basically quiet and Constantly like there's never anything locked when you turn on debugging you can find out what filters were applied and why it decided to choose which hypervisor and Which hypervisor it actually chose and then On the Nova compute the actual hypervisor side It's the same original requests ID and then showing it successfully scheduled it successfully create the VM So really kind of going through that flow if you know how Nova works each service to each other service You can see like where the request failed Which service kind of conked out? So logs are great and you're gonna run into problems. That's probably the first place you want to look but Staying up the cloud for the first time. That's hard. It gets harder when you actually have people using it and relying on it So they're gonna I mean you have a full list of teams is just an example of everybody would pretty much have to answer to whenever something does happen and You want to have those logs to point and say like here's what happened to explain why help them Remediate the issue the important thing though is to try to avoid it in the future So you want monitoring in place you have an issue put a monitor to catch that issue And then I mean really the ultimate goal is to auto remediate self-heal So you have an issue have your monitor catch it and then try to try its best to kind of resolve the issue at that point if you can Another really important thing that you want to strive towards is to consolidate your logging you don't want to be logging into every machine every OpenStack server every hypervisor anytime there's a problem to dig through the logs and see like okay, which service failed and You know, it's it gets mess especially when you're going at scale So can tell consolidate your logging, you know, I happen to go to a centralized log server go to an elastic search There's just great benefits to that you can see Somebody's gonna email you and say open stack is broken and that's not helpful So you want to say like okay this range of time. What's going on? Which services are airing out? Where are my 500 errors from my API's? And really dig into that. That's a good starting point Graphite Kibana great dashboards to kind of visualize provide your customers with information So they're aware of it too because just having an open area would ever everyone has access to see what's going on within your cloud It helps give you more time to actually fix the issue instead of answering like yes, it's down You know, here's why you can actually solve it. They can see for themselves what we're wrong And then here's just an example of one of our many dashboards Not a lot happening here. There's a small blip where hypervisor went offline But we can say to our customers like you're on this hypervisor. It went offline during this period of time That's why your VM was offline and Then Scott's going to talk a bit about you know in depth about VMs So taking a step back for a little bit Some of the basic troubleshooting often revolve revolves around VMs themselves User complaints that they tried to launch a VM and it didn't launch it was aired out or they've got some kind of issue Maybe they lost their network on a VM and they need a little help troubleshooting it So how do you tie? The information that they give you to the controller to the hypervisor and and understand basically the whole picture of it I would say to start way at the back Understand how your hypervisors are named and the easy way to do that is with Nova hypervisor list Other commands take hypervisors as their Inputs to limit the output that you get from the Nova command. Otherwise, you'll get too much information to wade through and knowing whether your Hypervisors named as a fully qualified domain name or a simple name can make a difference So this is an example showing that all of our hypervisors here are fully qualified domains Then you can take a look at what VMs are running on a specific hypervisor So in this particular case, we pick one hypervisor the HBO to and and did a Nova list Because by default Nova list only returns the list of VMs for the tenant that you're currently running as in this particular case We're running as admin tenant You need to specify the all tenants option in order to list all VMs across the board. So in this case HBO to VMs are showing we've got four of them one of them in an error state Taking a closer look at the VM that aired out you can use Nova show and you poke the UID of the VM And you get a whole bunch of information associated with it It might be a little difficult to see but Some of the information that you see here is the hypervisor that that VM is running on instance ID Which is what Libvert uses on the hypervisor side IP address information You get the the tenant ID the user ID of the person that actually did the launch And then of course you can see here you get a rather large Python error This particular error is a little nondescript. I think it says at the very bottom that An unexpected error occurred which is really helpful Sometimes you can take this information and maybe do a Python stack trace to kind of Digest what it really happened a lot of times the kind of errors that you would see here will be something like quota exceeded on something Maybe ran out of neutron ports Or there wasn't enough security groups or something like that and you would see that very clearly here And be able to take that immediately to back to the user without having to do any more troubleshooting a little finicky You can also take a look at the output of the console for the For that VM so basically as if you're watching the boot up cycle when the VM was launching You use the Nova console dash log and again specify the you ID of the VM in this particular case I just pulled out some specific information out of the console and You can already see some useful information that can be used later for troubleshooting That includes IP address information. You can see the Mac address You can see if the VM has more than one interface showing for whatever reason And you even get a little a little information about the SSH key that was used Probably the IP address and Mac address are probably the most useful pieces of this this puzzle If you wanted to take a look at the firewall rules It's kind of a lot of users refer to that as firewall rules on a VM security groups are the one part of it which Is stored as part of the VM configuration and then you got the other side on the hypervisor that would be stored as IP table rules or Open vSwitch OBS OBS rules If you want to take a look at what the rules are in for a VM You can use a Nova show to kind of get the list of what security groups are there in this particular case I chose a VM that shows two security groups a default and then another security group and I did also pull out the tenant ID because If anybody's ever done the next command, which is neutron security group show or are done a Security group list you'll see lots and lots and lots and lots of security groups and maybe most of those are default and Considering that every tenant has a default security group and you need a way to actually narrow that down So the next command you see I actually take the tenant ID and use that as a way to limit my output Neutron security group list dash dash tenant ID and then I'm gripping for the two security groups And here I get only a single default as opposed to the hundreds of defaults and the important the pieces of data you get from this are the The IDs of the security groups themselves We can then take the security group ID and feed it into security group show Which is another neutron option and you can now see that this particular group has two rules in it and I highlighted in red The security group rule IDs Open stacks really big on IDs. You have to keep them straight because like in this particular case you will see the word ID use multiple ways in different contexts and and You need to make sure that you're you're pulling the right ID Taking a look at the other security group This one has a lot more rules. I didn't highlight ID on this particular one I'm not actually going to go through the exercise of showing what these rules are but you can see That poor information is showing for these Port 8080 8443 8009 22 so SSH is being opened All these are very clearly showing for this particular security group now Taking a look at the hypervisor side. We want to connect up the dots We were just looking at it from the controller side on the hypervisor side See if I can go backwards and come to it again Well If you are able to clearly read this slide, this is a little PowerPoint glitch The way that you take a look at VMs on that side the easiest way is to look at the process list You know look at the KVM processes running on the hypervisor grep do a pscf and grep for the chemo Processes and if you just examine a single process which Really would be nice to show What I highlighted in red is that the process line shows a number of details about the VM that you can readily see and Immediately start to track other pieces of information from that Some of the information that you get Includes the you ID of the VM for that particular process You can see the port information the port name An IP address of the console port Let's say for some reason the console the VNC console is not working in Horizon You could SSH tunnel through and connect straight up to the the port and take a look at the console visually if you wanted to do that Other information includes the actual physical directory on the hypervisor that stores The configuration for the VM the disk images and stuff like that Also, you can see the memory and CPU is allocated to the VM all from the process line If we go take a look at the physical directory on the hypervisor, you can see here that you get the console log This is what? What the nova console log command actually dumps out you see a couple disk images The one that says disk is actually the root disk the one that says disk dot local is actually the ephemeral disk and Lib the Libvert XML is the configuration of that VM which the KVM process is built from Another thing to keep in mind when you're doing troubleshooting is that the disks are not really full-fledged disk files their backing store Discs that map back to an RBD image That is downloaded by Glantz So if you know how VMs are launched one of the parts of the process is that the VM is launched It's pointed at an image that image once the hypervisor is chosen That image is looked at on the hypervisor is does the image exist already is it a centOS image on on disk If not Glantz will download it and once it's downloaded Then the VM the KVM images launched as a copy on right for that specific image And then basically the disk and disk dot local files that we saw on the previous slide are deltas of the images stored In the Glantz cache The cache is basically varlib nova instances underscore base and you can see here that there's a whole bunch of images Already cache couldn't tell you what they are But the ones that say ephemeral you can figure out that those are probably 100 gig and 17 gig 20 gig 45 gig disk images If you wanted to actually track to what Which of the KVM processes are using which of these images you can use fuser or else LSOF fuser I find to be a little bit easier to use and If you know the process ID of the process that you are looking at you can basically Connect up the dots as I show here as an example And in this particular case you can see which Glantz image is associated with a disk File and which ephemeral disk image is associated with the disk dot local file So coming back to the the whole firewall rules concept of the VM How do you associate the IP table rules that are running on the hypervisor to the security group rules of the VM? The easiest way that I find to do this is to look at that libvert XML file and look at the tap interface that's there You just simply grep out the tap Pattern from that file in this particular case We got a single tap device and then you can use IP tables and grep for that specific tap And you see here a number of chains that show up. That's highlighted in yellow There's actually two output chains and one input chain showing But it's enough information to tell you the actual exact name of the chain in which case then in the next example You can do IP tables to dump out the entire Input and output chains and that should map pretty close to the security groups that you saw previously Looking at the top one. That's the the input chain You already can see port 8443 888009 and port 22 which is exactly what we saw in previous examples In both cases input output one of the other things that you'll see is a couple extra additional ports in this particular case You'll see the DHCP Ports have automatically been added to this particular VM's IP table rules because well, it's got to get DHCP So kind of continuing on that note. I'm sure everyone's seen this at some point. You boot a VM It's just not accessible and DHCP is to blame the DHCP servers not behaving as expected so Every VM interface is creating a port in newtion when you're using newtron is creating a port in newtron and you can kind of look that up with the newtron port list command just looking for the IP that the VM was assigned and Then on your hypervisor is a few a basic TCP dump command can look when that VM is booting See what's going on with the DHCP traffic. So let's port 67 and 68 and in this case The VMs booting it's seeing the request from the VM To the server like broadcasting to any DHCP servers, but there's no response. So obviously something's going on and So on your network nodes There's a network namespace being created for every network for the DHCP agents and along with the routers There's a few examples. So it's like each of those network namespaces correlate to a newtron network that you've created in your environment So this first highlight at the top the net list That matches up with this DHCP namespace. It's named QDHCP dash and then the same ID. It's your newtron network All right, so You can also run commands within that namespace. So just IP net-ins exec and Your namespace and then whatever command so here I'm just listening interfaces with IF config and that's giving back the loop back And the interface that correlates to the DHCP server. So Again here it kind of cuts it off. It's I don't know. I've never liked why they did that But maybe some kind of character limit, but it tells you NS for namespace dash and then kind of like the third of the ID for the newtron port So if you kind of if you do a newtron port list here again, you can find this It'll tell you the IP also 10 6 1 12 12 and you do a newtron port show on that. I'll tell you it's DHCP Because I'm in the namespace so That's the interface within the namespace So you can also do a TTC TCP dump within the namespace against that interface Here I booted the VM on the hypervisor. I'm checking on the newtron while I'm seeing those requests come through for DHCP At the same time, I'm looking at the network node in that namespace on the in on the DHCP servers interface and no packets So obviously something's going on in the namespace And Again, I can try to ping against the namespace. I can't ping the gateway. This is a provider network So just a basic vlan. I should be able to ping the gateway and I can't There's a few things you can check basic newtron things to look for so make sure your agents are healthy Make sure the DHCP agents are SD hosting your network. There's a few commands for that I can never remember them, but If you do newtron help it'll list it all out and just grep for DHCP. That'll tell you everything DHCP related so you can list the network on the on a specific DHCP agent you can list a specific Say give it a specific DHCP agent say what networks are on this and then if it's missing just add it Think a common IT thing is to blame everything on the network so Sometimes it actually is the problem. I'm not a networking person by any means but like a few basic commands I know TCP dump. I know I have config and a few of the IP commands And I can provide that to the networking team and say, you know, what's going on here's some output And they can actually make sense of it. So a few common problems the view of vlan network for your VMs The port's trunked, but it's not allowing that specific VM for your VMs your vlan for that VMs Then I mean the network team needs to add that on the port switch And then another thing we've also seen before is they're just allowing everything and it's just tons of noise on your network tons of drop packets Just kind of keep those pruned. It'll help you in the long run So after the network team was given the information, you know Add the vlan to the port and here are the same steps repeated and I can ping the gateway TCP dump is seen the traffic coming in from the VM the request and also providing a response and on the hypervisor side Rebooted that VM the request comes in and I'm getting a response from the DHCP server So to state take a step back Just so we kind of dug deep into the some of the networking components and some Nova pieces so I wanted to take a step back and just to talk over how a Sender volume gets created and just to kind of give you a picture of how complex Pretty much everything in open stack is so when you request a new volume from from sender a Request comes in through the REST API. It could be through the CLI tools or through horizon or some other API method so the sender dash API service receives your request It does initial validation It obviously validates the keystone token it'll validate your quotas So if you don't have enough quota for the sender volume that you're Requesting you'll you'll get a for 400 type error right away So if everything checks out it actually will place this information into the queue and also into the database So this is very very high level. I'm skipping some things here just to kind of just do a 10,000 foot view of how things work So so then sender volume picks up their request through the database through the queue And and actually forwards it to the sender scheduler sender scheduler creates a list based on the size the type availability zone And you may be like extra specs or anything else like that that you're requesting After that the sender volume actually iterates through the list and It goes and actually talks to the back-end drivers until It's able to successfully Create the volume. So the back-end driver actually does the grunt work to create the resource So if you're using Ceph, it'll talk to Ceph if you're using, you know, any other Supported sender back-end It'll talk to that particular back-end and create the storage resources after that sender volume gathers All the information about the volume. So you'll have all kinds of metadata that gets that gets gathered Probably the most important thing you need is traditionally like a connection information So if you're using something that's ice-cuzzy based you need to know How to connect to that volume It'll have information about your self cluster or any any any other kind of back-end And then I think this isn't in the wrong order, but the sender API at some point will respond back to the client so That you know, the request was successful So you can kind of see that there's multiple entities that kind of work together to create This request and this is a very very simple request If you're doing things like attaching a volume to a VM, you'll be dealing with Nova if you're booting from volume that gets a little bit more complicated So if any of these things break down You're gonna have problems for example If your back-end driver isn't set up properly Your request will go through And you'll get information about the volume, but pretty quickly it'll error out And you'll get a volume. That's an error state So let's see. Let's take a look So this is really kind of another way so like Jimmy was saying You should ideally have centralized logging set up, but what if you go to your Elk stack your Kibana console and You have all zeros. Nothing is working and you got to troubleshoot this So this is kind of an example of another way of dealing with this If you're troubleshooting sender, you're gonna be dealing with all of those services that I talked about And if some of them are running inside containers You got to connect to all those containers and it's kind of it's kind of a pain to gather all the logs So kind of a shortcut until you fix your centralized logging is you can use Ansible and here I'm calling Ansible on the cinder underscore all Host list and I'm just doing a grep command for all of the cinder logs essentially and Kind of going back to what Jimmy was talking about earlier You do have a request ID a unique request ID and if you know that request ID from your From from your initial API request you can kind of follow through this. This is sorted in in in like a time order so initially you can see that the The cinder API receives your request It sees that it's a request to create a two gigabyte volume. It actually passes that to So actually the third the last line that you see that big blob is All kinds of information about the volume most of it is like default values You can see things like the the cinder volume type the size Who requested it the name of the volume all kinds of stuff that's associated with this request This is the same request You go through Actually, so you can see right away That the cinder API returned a 200 error error code, which means it was a successful request And then you can kind of see sort of in the middle of that Right there you can see there's an error saying that the volume services down and then Below that you actually see that the volume was created successfully So if there is multiple cinder volume services And there's multiple ways for cinder volume to create the volume it'll it'll keep trying until it succeeds so in this particular case for some reason I have a Cinder volume that's down. I should probably go troubleshoot that but we probably don't have time for that anyway at the end you have a message that says that's The volume was successfully created. So this is this was a grep command on that particular Request ID and if you if you don't so obviously like there's there's going to be more log entries in the individual service logs For for this request and you can kind of Dig deeper into those as you're trying to troubleshoot some stack trace or something like that So hopefully you get some some information. That's useful a lot of this as an example for If you're troubleshooting sef one of the very basic things is your Individual nodes need to be able to talk to the sef cluster. So sef is like a client cluster model So if you go on to onto your node, you should be able to run like a sef status command to to be able to To get a status of the sort of the sef cluster if you don't get that using the the cinder volume user That probably means you don't have Proper authentication set up or something else is wrong. Maybe you don't have all the sef libraries installed and stuff like that so another kind of a wanted to give a specific example That we've seen that I've seen in production where if you have a misbehaving client that doesn't cash Keystone tokens and it's a maybe a very very busy application It'll start hammering keystone with new token requests and you can you can get So like a Symptom of that can be slow slow response to new tokens or you're actually your keystone services airing out if it's completely You know falling over under the load. You'll see high load on the keystone service Obviously, you'll have a ton of stuff going into the the token table and the keystone database So those are all kind of symptoms of what's happening. You will be able to see a lot of requests and keystone logs so if you are doing TLS offloading if you have like an SSL, you know load balancer in front of keystone service, which you should You want to make sure that you have x forwarded for Enabled so that that way you can look in the logs and see where the requests are coming from And another cool thing you can do is you can do you can if you have telemetry services enabled for your identity service You can actually go into into the metrics information and and find out who is actually hammering keystone service One way to mitigate that is maybe you can put a limit on how many requests can come in from a particular IP address on your on your load balancer So here's an example of specifically for keystone So this is kind of a this was I think this was done I guess like last year maybe maybe a little bit longer Maybe I think this was ratified initially like in 2013 so cloud auditing data federation is a standard for all OpenStack services to to To do auditing and logging so if you enable this with keystone and Those above that this first box kind of shows you Some of the things that you need to enable in your keystone configuration file This gives you tons and tons of information about in this case and about keystone, but you can also use this for other services, obviously so here's an example excuse me, here's an example of the successful authenticate event and You can actually see all kinds of information here. This is all one line in the log file but you can send this to To your telemetry services Hopefully you have all that centralized and automated and everything is alerting properly But this gives you a lot of information about the authentication request Here's some more examples The first one is an example of a user being created in keystone This you can see like the for example the the outcome is successful You can see the the the tenant ID you can see Hopefully who tried to request this all kinds of stuff Another example of a role being created There's all kinds of Very useful information again, you can you can look at this manually in the logs or you can You know have it properly federated into telemetry There's other ways to consume these events as well You can use a salameter to kind of look at the the telemetry data that's available Provided an example here of looking at a particular authenticate Event and you can see all kinds of information associated with it there you can also Consume the data directly from the rabbit queue as well if that's something that you're interested in An example usage of why you would want to do this Anton covered one of them like you're getting some kind of a token flood issue going on Misconfigured app, but you may also be trying to troubleshoot user authentication errors is the user actually not typing their password right or Or is it possibly somebody's trying to break into their account if you are Capturing these Events you can quickly determine what's going on and troubleshoot it from there Some of the ones that we've looked at authentication success and failures user create your updates and deletes Project create update and deletes. There's actually a wiki page that describes all the events and it's not just keystone related But also nova neutron. There's a huge list there that you can take advantage of and leverage in your troubleshooting whoop So anyway that says Questions We have time for just maybe a couple questions if you could come up to the mics and ask there that would be appreciated and if If anybody else want to ask questions, we'll wait outside as well Thanks