 Hello, everyone, and welcome to this talk. The title is, You Know Nothing, John Snow, Opus Dak troubleshooting from a beginner's perspective. Some of you have probably seen Game of Thrones and you are familiar with the title of the presentation. Ygritte, in this picture, was a remarkable woman who lived the north of the world. Throughout the series, she says, You Know Nothing, John Snow. Well, several times. And John is one of the main characters and John knows a lot of things. He is a skilled swordsman. He has a political and a strategic mind and he has charisma and leadership skills. But when John is faced with a new situation meeting new cultures like the wildings that he does not understand, he looks like a noob. However, soon enough his skills and previous experience will help him to understand and deal with these new cultures. I find this analogy fitting people starting to work with Opus Dak. You probably worked as a CIS admin and you did your share of troubleshooting Linux services. You can probably read the log file. But when you start working with Opus Dak, it feels very overwhelming. It's a multitude of projects and it's hard to find what's wrong and where before gaining some experience. This is a session about troubleshooting and I will share some of my personal experiences and advices and I hope this helps. My name is Elena and I work for City Network. City Network is a company based in Sweden with approximately 50 employees and City Network is a cloud service provider and our competitors are AWS, Google Cloud Platform and Microsoft Azure, amongst others. It's crazy to think about it this way, right? I'm a troubleshooter and I worked in the front lines for many years interacting with demanding customers and fixing the problems they reported. Well, demanding means some operators from Japan that expect details and details of details to accompany any technical solution or American telecom operators, keen on overly detailed written procedures with a lengthy approval chain before they apply any change in their life systems. Now, there are many problem solving strategies and methods and I wish I could come in front of you and say that I found the three steps approach that solves any open stack problem you will ever meet or run into. As much as I wish I could say, I found the one page solution to Fermat's last theorem. You know it, right? It says that there is no integer greater than two that is a solution for this acuation. There is no silver bullet when solving a problem in an open stack cloud. And it took 300 years, 100 pages and a remarkable person like Andrew Wiles to come up with the proof for this theorem. In this talk, I will present three problems and the steps I follow to solve them. Define the problem, establish a few hypotheses of a possible cause, test these hypotheses, apply the solution and document the solution. Now, let's look at some problems. The first problem, a customer sent a ticket and they say they cannot connect to port 1521 on VMB from VMA. But ping between the hosts, between the two hosts is okay. They also say the problem must lie with the open stack neutron and they need help to figure out what the problem is. Step one, define the problem. My first advice is do not fall for the confirmation bias. This means do not jump at asking for neutron logs because someone said the problem must be in the open stack neutron. Ask for logs and command outputs to confirm and define the problem. Ask for telnet to port 1521 from VMA and VMB. If the service is not running, we would not want to miss that. You can use host names instead of IPs if it's needed. Many times, telnet or curl are not installed on a node or a VM and you're not allowed to install them either. So you can test if a port is open by using cut and depth TCP, like in the last example here. Does ping work all the time? Ask for output of a ping command. Try also ping with bigger packet size, prohibiting fragmentation and the shorter time interval between packets. That's the ping that, the last one in this picture is the ping that helps you to find out the path's MTU in case this is an MTU size problem. And this is what I got back. Do you see what the problem is? So the service is not running. As Rosella Splendido says in her talk on troubleshooting neutron, when someone says, I cannot ping a VM, check if the VM is up. You should totally check her talk on the, you should check out her talk on the neutron troubleshooting. Surprisingly, the service running on 1521 was not up. So start the service and the problem is solved. This was an example of solving a problem from step one where we try to define the problem. However, don't skip the document the solution step. You might think that there is not much to document here, but don't close the ticket with simply problem solved before making sure the communication with the customer is recorded in the ticket. The questions you asked and the replies you received should be in there. Someone might be bumping into this ticket later on might learn something about asking questions in a clear manner as a minimum. The lesson learned here is try to understand what the problem is by asking short and clear questions. Note that if there is a time zone difference between you and the customer, it might take a day until you get the reply. So if the questions are not formulated in a clear manner, you will lose an additional day to clarify what you asked for in the first place. Don't ask, send me the security groups, which might have been the problem in this case. You can ask instead, send me the output of this command. Customers use heat templates to deploy many resources in one go. So with this command, you need the stack ID only. Then you can filter on security groups and you can gracefully pipe the output to some XRX magic to get the security group roles. And you will get back this. So now let's check if we have a rule for the port 1521 and we have it, it's been there all the time. Yep, moving on. A second problem now, a customer sent a ticket saying that they rebooted the first out of three open stack controllers and they lost access to it. In this case, the controller was a VM, not a bare metal controller. So it's time to ask some questions. Don't ask, what did you do? What did you change? From my personal experience, the interlocutor switches to defense mode. Your question reads, what did you change to break the controller you incompetent person? Ask instead, how do you access the controller? And I sometimes try to stay away from yes, no questions because that's the easy way out. So I don't ask, you mean you cannot SSH? If you cannot SSH to the controller, ask for the console logs. Or you can ask the customer to connect to the VM using the VERSH console and send you the output. Step two, establish a few hypotheses. The console log showed these messages and it seems the VM hangs on boot. And that's because some disk drive is no longer accessible and is no longer attached to the controller VM or there is some misconfiguration in ETCFS tab. Step three, test these hypotheses. Boot in single mode and check out the ETCFS tab file. Step four, apply the solution. Remove the offending line if the customer agrees and reboot the VM. If not, you will have to fix this misconfiguration. Some months ago, someone added a line to FS tab to mount a shared device and backup some files to a remote destination which does not exist anymore and causes the VM to hang on boot. Write the solution in short sentences in the ticket. You can add command outputs and log snippets and you should create a wiki page too because someone else might run into this problem later on and it's helpful to have a well-written solution on how to recover a controller or a VM that won't boot when something is wrong in FS tab. Now let's look at a third problem and that's the recurring type. The customer sent a ticket and they say they observed slow access to block devices hosted on the backend storage for some of the VMs. On other VMs, they cannot access the block devices hosted on the backend storage at all. The rest of the VMs are working fine, no problem with them. Step one, define the problem. Start with the VMs that cannot access the block devices at all. How are they spread across computes? Here is an example on how to filter for the computes where the VMs are running and this is assuming all VMs belong to the same heat stack. List resources, filter the Nova server, type the output to XRs and extract the hypervisor hostname to get the compute's names. Alternatively, you can look through a list of all the VMs and extract the hypervisor hostname again, giving that you have the right privileges. You should ask for kernel logs, Dmessage and syslogs from the computes hosting the VMs that cannot access their block devices. So these logs might show why the devices cannot be accessed or perhaps some other errors. Step two, establish a few hypotheses of a possible cause. Here's an idea. If the VM cannot access the block devices on the storage backend, it might be a networking problem. So look for log entries concerning the storage network interfaces. The Dmessage log shows several entries like this one. Step three, test these hypotheses. The storage interface seems to be down. So try to bring it up. It worked and the VM can access the block device again. Step four, apply the solution. Apply the same fix for the rest of the computes or problematic interfaces. And the computes on the computes where we have slow access to storage, you can try to reset the interfaces with errors in the logs. Step five, document the solution. Write the wiki article where you detail the symptoms of the problem. And for sure add the logs with the relevant entries and the fix. Write short sentences and add command output and logs. I personally like the Red Hat Knowledge Base at access.redhat.com. Well, when they don't hide the solutions. So we came all the way to step five. However, the problem is not solved. A few days later, the customer sent a new ticket saying that they see the same issues happening again. Some VMs face slow access to block devices and some cannot access the devices at all. We never found the cause for this problem. So we merely have a workaround. If I try to deactivate and activate back the problematic network interfaces, it will be a matter of time until the problem shows up again. I do not want to get another ticket from this customer. Do I? So when you start working with OpenStack, I can highly recommend you find yourself an expert. Everyone knows who they are. It's a person with some years of experience that can help with almost any technical problem. It's also the person that everybody wants a piece of. So, and in most cases, it is the person that writes excellent documentation. Ask for help if you get stuck, but prepare your questions thoroughly. Make sure you can coherently define the problem. Have the logs ready in case someone needs to check them. By doing this, you don't waste someone else's time. And in my opinion, it's a matter of showing respect. To further investigate the problem described earlier, you might need to log into the customer production platform. You can of course keep asking questions, but it will take a longer time to come to a solution. So we go back to step one, define the problem. The problem seems to be limited to some computes only because it happens on 12 computes out of 168. So let's look at the network interfaces on these computes. They look like this. So we have two interfaces for the control network. They are ETH0 and ETH1. Then we have two interfaces for the storage network, ETH2 and ETH4. And we have two interfaces for the traffic network. The storage network interfaces use the kernel driver and you can see their device name in this output. The traffic network interfaces, the ones that don't have a device name, use the DPDK driver, the technology to offload TCP packet processing from the operating system kernel to processes running in user space. And you want this for higher performance. The last four interfaces share the same bus. Look at the PCI addresses. So 0008300 something. The network card looks like this or approximately like this. So we have network interfaces controlled by the kernel driver and network interfaces controlled by the DPDK driver on the same physical card. We move on to step two now. Establish a few hypotheses of a possible cause. The problem showed up on the computes having this particular network interface cards installed on them. So we assume it is related to these nicks. Step three, test your hypothesis. We ran IPERF on the nicks with the ports controlled by multiple drivers and observed first packet drops on the ports controlled by the kernel driver and the lower TCP throughput. And the complete traffic loss due to TX hang on the ports controlled by the kernel driver. Step four, apply the solution. We raised the bug report towards the manufacturer of the network card. And later they provided a patch that solved our problem. Well, it was not just one patch. It was a few patches. It's time to go back to the wiki page and updated. Specify that you can restart the interfaces as a workaround to solve the problem. You need to load a newer driver provided by the network card manufacturer. And the nicest thing you can do is to share this problem with the world. Write it on your blog or tweet about it or make a video and upload it to YouTube. Okay, so I'm running out of time here and I hope you find something useful in this talk. I'd like to thank a few people before I leave and special thanks goes to Florian for teaching me excellence every day. Thank you Costa for the first interview and for giving me the chance to start working with OpenStack. Thank you Klas, Angelo, Murti, Hans and many others. And it's great to learn from the best. My slides are available on GitHub. Okay, and thank you for listening and I'm ready for your questions.