 I got this. All right. All right, welcome. Thank you for coming to our session, OpenStack Troubleshooting. So easy, even your kids can do it. So hopefully, when we get to the end, you'll help us validate whether that claims true or not. So first, let's talk about who we are. I'm John Joswiak. I'm a cloud solutions architect at Red Hat. I've been working with OpenStack since about grizzly time frame. I was doing consulting for a few years, and so I'm really used to deploying OpenStack and seeing all the various ways that it breaks. And my name is Vinnie Valdez. I've been with Red Hat for almost 10 years now in various roles. I was also in consulting as a consulting architect for a number of years. John and I actually were part of a new team that was started about three years ago to work with customers specifically around OpenStack to help understand what their needs were and help build out their environments and then ultimately go and deploy them as well as train our consultants. So we had a lot of experience in terms of seeing failed OpenStack, not just on the deployment side but on the operational side. I was also part of a 13 person community team and we wrote the first version of the OpenStack architecture design guide. And so now I'm in engineering. I moved out of consulting. But I did present at the Tokyo summit on the day in the life of an architect. And so hopefully we're going to bring our experiences to you guys and help you kind of go back and implement these in your environments. Now there are a number of sessions that have been held on various troubleshooting and we've done our best not to duplicate that information at some of the previous summits. We'll talk about kind of hopefully something new that we bring to the table. All right, so just in terms of agenda, we're going to talk through a bit about troubleshooting approach and some best practices to troubleshoot and then focus more on how we can automate troubleshooting. Then we're going to look at two very specific use cases. Now these are very common. If you're at this session, you've probably encountered one of these. So the kind of generic, no host found. What does that mean? How do I find out exactly what that means? Rather than just a no host found. And then instance connectivity. So specific to Neutron, a Neutron environment. You get the dreaded, you know, you assign a floating IP and you can't connect. You can't ping it, you can't SSH. And I'm sure that there have been different approaches to that. So we're going to look at that very specifically and how we attempt to solve that problem. Okay, and then after that we'll walk through, after the manual deployment, we'll walk through and show how we demo automating that deployment or that fix. So as we said, we do think that troubleshooting on the stack is difficult. And we all know that. Now this is a lofty goal. We don't know if toddlers are going to be troubleshooting on the stack, but we do want to make it, or we think it should be easier. So we're going to talk about some approaches and a project that we've started and hope that you guys will take that and use that in your environments. All right, so first off, before you even start troubleshooting, you know, don't be reactive. There's some things that you can do in advance to make certain you're more successful. So you shouldn't have to wait for a user to call in. You should have some capability there to alert you of problems in the environment first. In addition to that, you really want to know what a working system looks like. So if the first time you're hitting a problem, if you're the first time deploying open stack, it's a little bit more difficult, right? But if you're operating a cloud, you want to know what a working system looks like. You want a reference environment that you can say, well, here's what it looks like in the logs in this environment versus here's what it looks like in a failed environment. And also, you know what systems you're troubleshooting. You know what logs you need to dig into. There's a former coworker who's no longer with our company who famously told a very big customer on the phone that open stack was just too difficult. There's too many logs. He didn't know where to get started. And he lost a lot of credibility at that point. But, you know, there is some truth to the statement. There are a lot of logs. There are a lot of components, but know where you need to start searching. Don't start panicking and just start dumping logs. And understand what you need to look for. And then also try not to test a fix in production. If you can reproduce a problem in another environment, you would rather troubleshoot in that other environment before testing your fix in production. If you do have to work directly in production, make certain if the fix doesn't work, that you back it out. Because if you go through and do one fix, two fix, three fix, four fixes, eventually you don't know you don't know how many things you've changed and you get lost in the environment. The other thing is, if you see problems occurring over and over, you really need to focus on addressing what the source of the problem is. And I've seen customers and other people write, you know, ad hoc bash scripts and cron jobs to restart services because they keep failing. You know, it's not really a long-term fix and it's not really scalable. So you really want to understand what are causing these symptoms and how you need to address those. All right, so I talked about, you know, being proactive and having something alert you over the problem. If you've got availability monitoring for service being down rather than me having to go troubleshoot, why something's failing like a Nova boot? If I know Nova computes down already, I'm already ahead of the problem before I'm starting troubleshooting, before I'm digging in on a system. You know, other things like RabbitMQ being connected, your API response times, and then just basic functionality. It makes a lot of sense to have an ongoing test that tests basic functionality like a Nova boot. So you have a special monitoring tenant that just does a Nova boot every five minutes, 10 minutes, you know, half hour an hour to validate that. That way you're comfortable that function works. You should never get called or it should limit the amount of times you're called from a customer saying, you know, Nova boot doesn't work. If that happens, you know, it's something more specific because you've got a working functional test already. And then going back to, I mentioned logging. So central logging is extremely important. So looking at things like your ElkStacks or ElkStacks or Splunk to help bring all the logs together, especially if you have lots of compute nodes or even multiple clusters. You wanna have a facility to be able to search for instance IDs or request IDs or specific error messages. And we'll talk a little bit later about how you can use things like Ansible, even just add hot commands. Or as we'll demonstrate playbooks that we'll actually go out and bring those together. So you're not having an SSH into multiple systems to figure out what the problems are. And then also you want to have some sort of performance monitoring. You want to understand what your baseline performance is so that you can detect something that's abnormal. For example, if the CPU on your control plane is at 50% regularly, all of a sudden it's at 80% or 90%. You know there's something going on in that particular host to investigate. And the same thing is true on the compute side. You can start to dig into problems that way. You also wanna look for things like rabbit MQ limits. If you hit a limit there, that's somewhat difficult to troubleshoot in the logs and it takes some time. Whereas if you've got monitoring just to say, I'm at my limit, that's just going to alert you and you're gonna know about it right away versus even having to dig into troubleshoot. And then something like MariaDB. Especially if you're running that in a cluster and you're using Galera for synchronization, you'll typically have HA proxy to load balance across those. We had a very large customer who had some hardware that was very beefy. When we're talking sizes that we hadn't seen before and the customer hadn't seen before. And so what ended up happening was we raised the database connection limit and that seemed to be fine. But there was an implicit connection limit at HA proxy that we hadn't thought about. And because the Python clients would spin up as many forks as possible based on the number of CPUs available on that hardware, it was actually dropping a lot of those requests because the hardware was just too large at that point. And then also things like floating IPs. You wanna keep track of how many floating IPs you have. You want to see some trending on when you're gonna run out. So instead of a customer coming to you, they don't have any more floating IPs, you've already addressed that before it becomes a problem. Disk space on your back end is another great example where you can do that proactively versus having to dig into troubleshooting that specific problem. And then you want to follow the idea, I think I mentioned it previously, you wanna have a production like environment to test all your changes. I consider it like a promote to production. When I'm testing new functionality in OpenStack, I like to use like an all-in-one or a small development deployment to really validate the capability. Then before I push that to production, I want an environment that mirrors that. So it doesn't have to have hundreds of compute nodes but you want it to be the same architecture. So if production is HA, you wanna test in an HA environment. You want it to be similar so that you can validate as much as you can before moving to production that you're not gonna break something by having different environment. And then something like Rally is very useful to benchmark your systems. That way if you introduce major changes or even doing things like benchmarking your hardware and then benchmarking the instances on top of them and kind of understanding what your virtual tax would be so to say. And then using automation. So that's something we're really gonna focus on once we get past kind of these general best practices that we're talking about. But things like Ansible and then of course Puppet and Chef but automating things using infrastructure as code and using get to store all of your changes. And then also along with that is making your changes as small as possible. Make sure that your commits are very small as possible. They're related to the same sort of changes that you're making rather than big sweeping large commits that are difficult to undo or really understand where something went wrong. Yeah and when troubleshooting don't make multiple changes, right? Make one change. Find out if that's fixed the problem or not. If it hasn't back it out then make your next change because like I said if you go through and you change one, two, three, four different things in environment eventually you're not troubleshooting the same problem anymore because you've created three or four new problems. And then of course some of those CLL clients have verbose flags but the dash dash debug are gonna be the most useful. And especially when you look at the new unified CLI client for OpenSec you're able to manipulate the data a lot better but you'll be able to see using debug you'll be able to see the actual curl request that is being sent to the API. And then lastly would be setting debug equals true in the various components. Now doing this, we'll have the caution if you, I'm sure some of you have done that in here but it will be very, very detailed. So it can fill up your logs it can really be hard to pinpoint the information you're looking for. So one of the things that we'll show you a little bit later is we wrote a playbook to dynamically turn on and turn off debug across your entire cluster as desired using an ASPBLE playbook. All right, so just a troubleshooting approach obviously you wanna narrow down the problem as much as possible up front. If it's a Nova boot issue obviously you wanna look at Nova you wanna understand the components that could possibly touch. Nova, your database, your RabbitMQ obviously Glance, Neutron but just understand what it could touch and what systems it could touch because then instead of troubleshooting across a hundred servers your troubleshooting across maybe four or five different servers. And instead of every component you've narrowed down to a small subset of components. And then you can also use something just starting with a basic check looking at open stack status making certain your components are up. This is assuming you don't have monitoring there at least then you can get a basic understanding of the environment quickly or if it's HA, a PCS status. And then back to logs, we mentioned that a lot but that is where everything is gonna be especially if you turn on debug. So a lot of times if you don't have central logging you may wanna start doing a tail against various logs and once you understand the workflow of a Nova request for example you'll understand that it touches a lot of components Keystone, Glance, Neutron and Nova so you have to tail a bunch of logs across a bunch of different services. Yeah and then if it was working previously did a configuration change? I mean some of the times environment breaks just because somebody's gone in tested a config it's changed and if you have some tracking of that change or if you have a tool like Puppet or Chef that you've used for the deployment then you can at least confirm that that configuration remains the same. And then you can use the install docs obviously to help troubleshoot but I find it much better to have a reference environment. So if you've got a deployment that you know is working you can look through those config files and compare since that's working you know what good is and so if it breaks sometimes the install guides aren't quite right but if you do find issues with the install guide submit bugs, submit fixes upstream. Yep that's the best way to do it. All right so tools to aid in troubleshooting or trouble detection. So Browbeet is a solution that a Red Hat's performance engineering team built and really it was around when they were doing performance troubleshooting and testing and finding bugs. So it's Ansible based and what it can do is it can find some of the common bugs for deployments that aren't necessarily automated in a Red Hat solution and at least then you're aware of those so you're not facing a problem that's already been discovered before. And then Ansible so we've mentioned that a few times we like Ansible quite a bit. We're gonna talk about what we've done with the playbooks in a little bit but also ad hoc commands. So just running an Ansible ad hoc command to grab logs across your entire cluster for a specific request ID is a good use case but one of the things we find with Ansible is not just the usefulness of executing commands but the playbooks themselves if you write your tasks with good useful descriptive names it could become a really good reference guide so maybe you're not always running it but you can kind of refer back to steps that worked at one point. So we like using kind of Ansible as a built-in documentation source as well. Yeah and we mentioned Puppet and Chef already if you have those solutions anything to do configuration management it avoids that config drift problem you can make certain your configuration is a known working config. And then triple O we have a product called Director based on that but it's a way to manage your OpenSec environments that your end users use via an underlying OpenSec environment itself. So that's where the triple O comes from it's OpenSec on OpenSec but as part of that we've written a lot of validation playbooks to help make sure that the environment is ready to have OpenSec installed checking things like networking and switches and your hardware itself. Yeah and then Tempest and Rally are useful Tempest is more for functional validation I think the problem I find myself with Tempest is it spits out a report and says yeah there's these hundreds of things checked and these 50 or so have failed and I'm just like you know the system seems to be working fine to me. So I find building your own functional validations is more useful because you know exactly what is running and Rally is very useful if you can benchmark or baseline your system and get an idea of how hard you can push it before you deploy into production then when you start to see performance issues later on you know is this within the range that I thought up front or has it gone way beyond that. The other part of that ManageIQ the upstream of CloudForms is a useful tool because it can do a couple of things with OpenStack it has a capacity and utilization data so you can see your host or hypervisor and the VMs that are running on that and get an idea of what that performance comparison is. You can also do analysis of config files so you can catch the configuration of the hosts and do a configuration drift analysis. So if you don't have Puppet or Chef running regularly at least you can see when things have changed and what could be a potential problem. The other advantage of ManageIQ in CloudForms is actually connecting to multiple OpenStack environments and as well as other virtualization in Cloud environments and give you this kind of single pane of glass to look across all of them. But lastly not really part of automated troubleshooting but when you're running into problems obviously Google is your friend. You know Launchpad, you'll see bugs out there or if you don't be sure to file those but a lot of times I've found things in patches that people submitted that I was able to implement in an environment for a customer ahead of the fix being available within packages and so forth. And then of course IRC. So all of these different projects we talked about have IRC channels that you can go out and seek help on. All right and then so what we've built is some playbooks to automate things. It's based on Ansible. We call it OpenStack Detective. And really the advantage is when you're walking through a complex workflow it's much easier just to let automation do that just like it's easier to have automation configure your system. It's a lot easier to have automation walk through this troubleshooting and it's also faster. For me to walk through for example the neutron troubleshooting that we're gonna look at later it's a lot of manual steps. It's a lot to remember. I have to look at my documentation on how I've checked it before because I'm not gonna remember all that. Whereas if I run a playbook it's like one line and it can detect that problem and show you the outputs that you don't need to memorize all that stuff. And this is where we hope that your kids will be able to execute a playbook. It's a one liner, they hit enter. But there are some disadvantages to that because these are very complex. First of all developing these different, we call some things health checks and we have other specific playbooks that we'll talk about but they're very complex. So they're very time consuming to build but you have to understand the underlying workflows first to be able to do that. So if you come in and you just use these playbooks you use OpenStack Detective and you get some value out of it that's great but you may end up looking at it as a bit of a black box. You may depend on it a little bit more than understanding what the underlying troubleshooting methods are. So we're kind of acknowledging that that might be a bit of a disadvantage. And we've also noticed as we've been testing and creating these different playbooks that there are some configurations between some of the older versions of OpenStack like Liberty and Kilo and so forth that have changed in Newton. So very specifically things like the Keystone auth variables like for example, instead of referring to admin underscore user it's now just username. And so unfortunately a lot of our scripting and playbooks have broken. So that may happen to you guys. If you have things that depend on specific versions. So we've had to do a lot of work around detecting versions and doing different things based on what version we're running. Yeah, so I just have this on my local GitHub right now. Obviously if there's a lot of interest in that we would look at moving this into a formal project but we've just left it there for now. Okay, so John's gonna walk through one of the use cases. This is the basic no-host file and he's gonna talk about what it takes to do this manually. So this is kind of understanding the workflow and the underlying reasons behind it and then we'll look at automating it later. Sorry. Yeah, so just looking in terms of the request flow this is a very old diagram. I wanted to use OS profiler to walk through this and refresh this to make certain it's all valid. You see Quantum still listed here. But this is basically what a request flow looks like. It's about 28 steps. There's a lot of potential pieces that could break in there. Realistically, there's probably a handful less that you'd actually have to look at. If you want, I've got the source on the slide here. If you want to dig into that. But basically just know, I mentioned up front, know what's involved in a process. You can see all the components that are involved in the process. So you have an idea of if my boots failed it could be Nova, it could be Glance, it could be Neutron, it could be Cinder. There's a handful of things there that you'd really need to look at and potentially troubleshoot. So no host found and Nova scheduler in itself. The way Nova does scheduling is it has a number of filters. And so if you start out with 10 hypervisors it runs through each of the filters you have defined in your environment. And at the end, that's its list of available hosts that could provision that instance. And then it will choose from one of those hosts. So there is good documentation online for all the filters. It basically explains all the possibilities. And you don't have to use all the filters. You could disable filters if you don't want a particular filter to run. But just looking at the list I have here, retry filter is basically saying if a hypervisor's failed I'm not gonna retry that hypervisor. I'm gonna rule it out. Availability zone filter to ensure your hypervisors are in your availability zone. A RAM filter to make certain that you have RAM. A compute filter to make certain the compute node's actually enabled and active and down the line. I'm not sure what you said there. Hey. Hey. Hey. Hey. That's awesome. So basically for the test we just picked an instance that was too big for our environment to boot to easily show a failure, right? One of the things that would be really cool to do with Ansible is to build a lot of failure scenarios and walk through and use that for training. Make a number of ways to break a system and execute it maybe on random and then have to go in and fix the problem or analyze the problem based on that. We didn't have that much time but it would be a really neat thing to do. So first thing when you do a Nova boot you don't actually see an error. It just says build is what's responded back to you. If you do a Nova list then you'll see whether it's been successful or not what status it's in. So a Nova list for example, if no host is found it'll come back pretty quickly and change to a state of error. Interestingly if there's no hypervisors available and you do a Nova boot it'll just hang on build scheduling for quite a long time. It won't actually come back an error at all. So if you ever see build scheduling hanging out there it probably hasn't found a hypervisor. So do a Nova show on the instance and you can get some detail back about the instance. It shows a small blurb about the error itself. In this case something like what's here in Nova conductor, right? It shows no valid host was found. There are not enough available hosts. And what that means is it's gone through all of its filters and it's returned zero hosts. But it doesn't show you an easy error just from Nova show. But in the logs of course, the data's there. So if you look in the Nova scheduler log you'll find a more detailed answer. So in this case the RAM filter returns zero hosts. And so you know at that point what filter has failed. So that's kind of where you'd have to start your troubleshooting. So if it was an availability zone filter you probably have no hypervisors specific to that availability zone. In this case we know we just don't have the right amount of RAM. And so to troubleshoot this of course for a core filter looking at the number of cores available or a RAM filter you just look at the hypervisor or look at the hypervisors themselves. So hypervisor stats can show you the amount of overall capacity for your environment. But I might have 64 gig of RAM available but if I only have one gig per hypervisor that's not gonna be worthwhile. So this open stack hypervisor list it's the cleanest way I could see to just output me the total and what's used for each individual hypervisor. And then of course there's over commit capabilities. So in general if you're not changing any of the configuration CPU allocation is 16 times over committed. I wouldn't suggest running there because things aren't gonna perform very well. But if you look at 16 to one it means that for every CPU you have you can run 16 virtual cores assuming that you've got your core filter running. Same thing with RAM at 1.5 to one it means that you could run one and a half times your memory keeping in mind that you have reserved memory that's kind of taken as overhead. And so when you're looking at what's available keep these over commit ratios in mind as well. So I think I mentioned these filters already previously just ensuring you have an operational host making certain you have an availability zone and retry filter just saying it's gonna not try the same hypervisor that's already failed. And then a couple more unique ones the compute capabilities filter is working off of what's actually defined in your flavor. So if you have a special property in the flavor it's looking for hosts that match that special property. If it doesn't find a match it's gonna fail with no hosts returned. So you know from seeing that failure on that filter what to start to look at. Same thing on the image properties side if your image properties are looking for a specific property if that's not set on a hypervisor if there's no hypervisor available that can fit that that would fail too. So that's just a brief walkthrough of basic no hosts found analysis. And we're gonna move on to Vinny's gonna talk about troubleshooting instance connectivity. So again not very kid friendly right but very important background information that we need to know. And so this is gonna be a little bit of the same here. So I hope everybody has a coffee or something cause this is gonna be a little bit into the weeds first and then we'll talk about how to make it easy. But what we're gonna do is we're gonna go through a scenario that probably everybody has been through. We've provisioned an instance, we assign a fully IP and then I can't ping it. I can't SSH what do I do? Now I'm assuming that this is not a provider network environment. This is kind of a tenant network or a project network environment. So you wanna go through the normal workflow. Have I assigned a full IP? Cause this is not automatic with Neutron. With Nova Network there was an option to make that automatic and this is not. Is there a router available in the project? Is it pingable? Is it up? Can I connect over the network namespace? Is that functioning properly? And can I ping the instance? Look in the Neutron logs as we've talked about. So once we've kind of validated all that then you can go to the instance itself, go to the console, log in, check the networking stack. Is there an IP? If there is, what paths out can you ping and so forth? So once you have gotten to that point and let's say assuming you don't have a network stack the DHCP was not able to obtain an IP address. That's the scenario we're gonna be going through. Now you have to go kind of fall back to understanding the packet flow and that's what we're gonna do is we're gonna walk through that and look at what it takes to troubleshoot each interface along the path between the compute node and the network host itself. So this is a diagram that's been out for a while. It's on RDO. If anybody's seen this before there's a very detailed explanation for each of these steps. We're not gonna go through that right now. I just wanna kind of give you a high level diagram but we're basically gonna be starting at the very top where we have these test instances. That's where we're gonna assume our instances. But there's a tap interface. That's what A represents there which connects to a bridge where IP tables is implemented for now. I believe that's changing in the future. That then connects to a BR int which then connects to the BR ton which then connects to the VXLAN overlay network which then goes over to the networking host and then goes up the stack. And in our example we're gonna be going into the DHCP name space. We're gonna try and troubleshoot that error. And now of course if you're going outside you're gonna be then hitting the router instead. But let's kind of walk through what that would take. So I quickly went through this. We're gonna hit the tap device first. We'll hit what's called QBR. Now Q is a holdover from Quantum. This is the Quantum bridge where IP tables is implemented. Then we have QVB which is the one side of the v-th pair which is connected to the bridge. Then the other side of that v-th pair is QVO which is then connected to the open V-switch side. That then connects into BR int. Then there's patch ton, pinch int. And so you can see this is pretty complex here. Then it connects to the BR ton which is where the VXLAN overlay encapsulation occurs. We move over to the neutron network side and essentially roughly the reverse steps. So we start with BR ton. We then move on to patch int, patch ton, BR int. And then I have listed Q router here but we're actually gonna be going into the Q DHCP name space. So a bit of a nasty workflow. So what we wanna do is understand what the general troubleshooting steps are. So we wanna connect to the console. As I said, we're gonna generate some DHCP traffic that is constant so that we know that we're generating DHCP discoveries. If you want to, you could actually boot into Gpixie so you can restart your instance and hit ability of control B which will give you a prompt and you could type auto boot which will generate some DHCP traffic outside of your OS if you think there's an OS type of issue. But in our case, we're gonna go in and just do that at the OS level. Next, you wanna determine which hypervisor compute host that instance is running on. As we said earlier, you don't wanna just kind of randomly start looking at compute nodes. You need to understand the one that's having the problem. Then we'll start inspecting the interfaces and go through the different interfaces that I mentioned. And then we'll move on to the compute, I'm sorry, to the network host and try and figure out where the problem might be. Okay, so in this example, I've connected to the console. I've started the DHCP client in a continuous mode and I'm just generating DHCP discoveries and typically I would have gotten a response but in this case you see I have not. Next, I wanna find out where the instance is hosted. So I've showed two different commands here. You can use the old NOVA list or the OpenStack server list and throughout the rest of the slides I try to show both commands where we're applicable. There are some things that are not always possible in the new OpenStack Unified CLI just yet. But in my case, I've found out where that hypervisor is. Next, I wanna find out what port that instance is connected to. So this is one command here that you can't, there isn't an exact equivalent under OpenStack port list. So here I'm doing a neutron port list and I'm passing the ID of the instance itself and that's gonna return the port ID itself. So you see the, I've got it bolded there, A7E. Well what happens is those first 10 characters, so the first eight, the dash, and then the additional two characters are gonna be used throughout the rest of the workflow in creating interfaces. And we'll see that here. If you do an IPA and you grab for a tap, QBR, QVB, and QVO, you'll see that they're all, whatever the interface is, and then the first 10 digits of the port ID of the instance. And so you should see those on the compute node. And remembering that order of interfaces that we get hit, now we wanna start TCP dumping those. So I'm just doing a TCP dump on the, starting with a tap interface and I'm looking for DHCP traffic. So port 67, port 68. And this is what I hope to see. Well, if it's working, I would also see a reply, but in this case I'm seeing my discovers. And so I see those correctly. So let's move on to the next interface. Now I'm on the QBR, I see my DHCP traffic. So it seems to be working fine. So let's move on to the QVB interface. I see my traffic there, so we can keep going. Now QVO, continuing to see my traffic, so that's good. And now in a later or a newer version, so I see this in Newton, I haven't made it, but go back and see exactly where this was introduced. But there's a VXLAN-SYS 4789 interface that is also used. So I can TCP dump that as well. And you'll notice that I haven't actually TCP dumped to be our int, be our ton, because those can't be monitored directly. You have to set up a mirror and a snoop device if you wanna do that. But I could actually, in this case, I'm just gonna TCP dump the actual physical interface. So I still see my traffic, so everything seems to be good so far. Lastly, I could use OVS-OF-CTL, which will dump my flow tables. And the reason I do that is if you look at the packets over there towards the right, if you run this with watch, like I have suggested here, those packets should be increasing. So you should see that there are packets flowing through this neutron workflow. So now I'm gonna move over to the network host, and I'm gonna start repeating the same commands, starting with dumping my flows. I should see my packets increasing. I can move up to the physical interface on the neutron side. I still see my traffic. Now I'm gonna move up to the VXLAN interface. Traffic's still there. So at this point, everything seems to be connected properly. So what I wanna do is I wanna move into the DHCP namespace. So if you do an IP net and S, you'll see a Q-DHCP namespace, you should. And usually in a larger environment, you'll have multiple. I've just, for brevity, I've only listed the one here. But don't just take it for granted that that's the right one. So you wanna make sure that that is the one DHCP interface for the network that your instance is on. So there's a few ways to do that. Here I'm showing a port list, and I'm listing the DHCP by the dash dash device owner parameter. And that should match up. I could then do a show on that. And I should see that the network ID, and that's what I've called out that column there, should match what's listed on that Q-DHCP. So you see it's B89, same as what we see at the bottom. You could also just do a neutron net list, and it should be the one that's assigned to that project. Just a quick time check. We have seven minutes. Thanks. Okay, so now I wanna take a look at that namespace. So I'm gonna do an IP net and S exec against that namespace. And you'll see that I commented out or I omitted the LO, the loopback stuff, but I have a tap device here, starting with 7063, which should match the port ID of that DHCP namespace. And that's what I'm doing here at the bottom. Neutron port list, bless you. And we see that everything matches up. And you can see the IP address at the bottom, 192168200.3, it matches what's inside that namespace. So everything seems to be working properly. So, we can now TCP dump inside that network namespace. If you notice I'm using a dash L parameter to TCP dump. If you don't do that within the namespace, you won't actually see your output until you stop TCP dump, then it buffers it, but this will let me see. So I'm in the namespace. I'm in the DHCP namespace. I see my requests, my DHCP discovers. And so we know that everything seems to be plumbed properly. So now this is a use case example. We probably should have looked at this first. But if we look at a Neutron port agent list, we see that it's actually down. So I can use a PS. I see that it's not running. I'm restarting it with OpenStack service. Now it's running and I should get my IP. So not so kid-friendly, right? Yeah, so we're gonna show how we've automated this. So from a demo perspective, we're gonna show the instance connectivity steps we've walked through. We're gonna show a more general health check. We're gonna show the fix that instance issue, and then walk through my no-host example. We should have a... Okay, so this is gonna be walking through what I mentioned. I have a couple of instances. I'm gonna grab the floating IP and we can try and connect to it. So I'll start off with a simple ping. Now I know this network works. I know that everything else should work in this environment, but I'm not getting the connectivity. So now I'm gonna drop back to the manual steps that you saw me implement. So first I wanna grab the ID of the instance. We're gonna use that in the Ansible Playbook. Then I'm gonna connect to the console and I just wanna make sure that the things are as I think they are. So I wanna log in and look at the networking stack. Now this is a little bit more difficult to automate. So this is kind of just, we don't need to do this part. This is just for demonstration purposes, but once I become rude, and I'll talk about why I had to do that a little bit, but I don't have an IP address. So that's a problem. So I can try and bring the interface up and it's gonna send out some discovers. There's not gonna be a response. Now I know that this will time out after a while. So what I'll do in this case is I'm gonna cancel this and I'm gonna bring up the DHB client manually and I'm gonna tell it to infinitely send out discovers. And that will guarantee that I'm generating the traffic I need and then I can go back and trace the different interfaces. Now instead of walking you through everything that I went through in those 20 slides or so, we're gonna execute one of the playbooks that we've provided by OpenStack Detective. So I've already created my host file which is just my compute nodes and my controller. So I'm gonna call this playbook with one parameter which is the instance ID. Since I copied that I already have that in my buffer and I'm running it with dash V just to get a little bit more information. And since this takes a while I'm gonna fast forward but I did want to show you if you start looking you see some of the TCP dump output and we start with the compute node. We fast forward a little bit. Now we're on the network node. We're looking in the namespace here right here and then we should see the TCP dump in the namespace. There we go and it's done. And so part of that it produces this report. So we can see all of the packets for the compute node. We can scroll down and then there should be all the packets for the network host. So this is just one example of that entire workflow that I went through is now completely available to be fully automated for your environment. Yeah, now the health check is a little bit different than that so instead of troubleshooting a specific issue it's just taking a look at the entire environment and saying is the environment healthy? So we're looking for things like is the MariaDB databases up? Can they be connected to are things like the Keystone endpoint setup correctly? Are the user and password setup properly within those? And so any problems there, services as well those should be called out. So we have an issue with Glance here but you see right away it can call out that the DHCP agent's not available, it's not running. So that gives an option to be able to troubleshoot or quickly review an environment for problems maybe rather than having to go into the troubleshooting at all. And then here I'm just gonna connect to the controller which happens to also be the network host and just corrected that issue. So we should see that the agent is down as I mentioned. So we're just gonna bounce that, verify that it's up and then now we switch over to the instance and it instantly got an IP. Well it got a response. This is the, we're using Syros at this point so I'm not exactly sure why it didn't work as intended but bringing the interface down and then back up solves that issue. And now I should be able to connect to it ping it SSH and so forth. And so just another way as we mentioned to automate some of the standard workflows of course you can modify these. We welcome any contributions and additional playbooks and so forth. Now we'll go through one more. This is gonna cover the no host found and John's gonna go through that as soon as this is done. Cubs win, they might. And that's just verifying that the address matches what we saw so we SSH'd in. And then the no host found is a very similar in approach. We've got a playbook that's going to check specifically for that host so we've gone through, we've done a boot of an instance that doesn't fit on our hypervisor. We do the NOVA list and we see that it's an error state with power state of no state. So we can go in and do the NOVA show specifically for that just to see the same errors as we've shown previously to give an idea of what the high level problem is. If we scroll up we see no valid host was found but no data as to why we got that. So rather than digging through all the logs, we can run a playbook to do that analysis for us. And it's just a NOVA trace logs and what this is going to do is it'll go to all the hosts, pull all the log data specific to that instance ID and then it will report that back to you so you could see from top to bottom or chronologically all of the instance actions that have taken place. So it's just executing right now. And if we go look at the instance log. It's going to produce a couple of logs. That's right, yep. So the first log is just that sorted log showing top to bottom, here's everything in the log for that specific instance. So you could look through here and find the error itself but we've also done some analysis against that. Specific to NOVA is found that can give you a much clearer answer right there that in this case is a disk filter that failed because we didn't have enough disk space. So it's not solving all the problems and there's a lot of capability that we need to build here but certainly if we built that up it would give us a capability. So instead of having to be an expert at OpenStack, the community finds these issues and if we build a playbook when we find it eventually troubleshooting is pretty easy because everybody's seen a lot of this stuff already. So that's everything we had to present about. We just want to remind you if you have access to a time machine we have several presentations Tuesday, Wednesday. But later today there are some additional sessions. So thank you very much, any questions? Oh yes, there's a microphone right there, feel free. This one's your water. Are the slides available online? They will be, traditionally they're always available at least within a week or two. I don't know exactly when that would be but yes. How to get to that information? You have some page set up where you put the link. We should have provided our information, huh? Yeah. I will tweet out where the slides are. So I go by at Vinny Valdez, VINNY. I'll tweet out to everybody about this session. But I believe they should be available via the app or some method. Your name is on the announcement. Yes, Vinny like my cousin. Hi, first of all, thank you for the presentation. And I just wanted to share, you skipped the OVS TCP damping, right? And recently I found a script, a Python script which is called OVS TCP damp and it makes it very easy to just use TCP and OVS ports. Very good, thank you very much. That would have saved lots of time. Okay, well, there's nothing else. Thank you very much, we appreciate it. Thank you. Enjoy the rest of the summit.