 There one two one two three. Yeah. Hello. All right. Hey Hello, everyone. Thanks for coming on My name is Mars. Dr. Naleev. I work for Nokia as a system engineer and I'm Mark. I work for a cronies and I'm here as a freelancer Yes, so I would talk is about open-stack double shooting This is beginners guide This is important. I think to distinguish we are targeting people who have just started with open-stack and have less And let's say six months or less than a year of experience of experience working with open-stack We are going to share our slides. You can see here is a link so you can download slides after or during the presentation at any time and See I'll try to do it yourself any scenario Again, this is beginners presentation We will try to share some common steps which you can use for troubleshooting a lot of common open-stack components Because they all share the same principles in many cases We will try to share some useful comments which we found During our experience it operating open-stack clouds and we also will try to cover How to get help that's the most important part. I guess how we're and how For these slides what we did we used a DevStack machine Digital machine which you can now download yourself at this link It comes from Upstream Institute. It's just a DevStack I'm not sure if you've heard about Upstream Institute, but it's a very it's a great initiative for open-stack foundation where you can learn how to Contribute to upstream contribute your code or documentation to open-stack itself So it's a one-day at every open-stack or open-infra-sum now We have this one or one-and-a-half days exercise where you can learn how to use kid What kind of commit messages you should use what kind of exercise all that stuff which developers like and if you're not developer You can be an operator or whatever you can still contribute code upstream. So please feel free to join after me Institute training We are reusing the same virtual machine for our slides and for our exercises. So if you're going to Download this make sure that you have enough RAM enough storage and all that stuff I guess Yeah So the question arises open open stack is a very old project now. Why at all we need a troubleshooting guide Why are you guys people sitting here? Because as we know software is broken Unfortunately the complete complexity especially in the open open stack world increases the room for errors It's a globally distributed project with more than 20 million lines of code Actually, it's probably more than 30 or 40 million lines of code right now just the last year. It was 65,000 commits And it consists more than 60 projects The platform itself deployed on hundreds or maybe even thousands of servers in your DC Which means there is a horizontal complexity and there are multiple layers of Software here, which increases the vertical complexity I usually like to talk about a little bit this mesh complexity just meaning that all these services are communicating with each other not necessarily in In a layered fashion, but also one-to-one peer-to-peer which includes these so called mesh complexity and of course we We are not just deploying open stack, but also want to run it and want it to be highly available Which introduces the so-called temporal complexity? So that's why we have to troubleshoot because for sure this complexity will introduce errors at one point I have a basic troubleshooting recipe here. Just go read the operations guide apply the knowledge and you're done and Bye, of course. This is not how it works When the problem arises, that's already too late So I I want you to prepare for it. Know your system to locate the failure Understand all these layers what what an open-stack installation consists of learn the tools That can help you in troubleshooting even if it's just, you know searching Debug messages in a log file or something and do not be afraid to reach out for help as you could Here during the keynotes. It's a very open community and even if you ask a question, which is not necessarily the smartest No one would thrown up on you just be just be Prepared that you can you can gather all this information from the community easily But when you go there prepare yourself with all the steps before The best approach to troubleshooting is to avoid troubles as Mars already told Told you we we having this Presentation as for beginners meaning that When someone will look at it it's easy I do not want to grab logs because I have my monitoring a logging system the other thing I have blue-green deployments and it just works and so on and so forth so I understand that this is the perfect word scenario and I also understand that this just doesn't exist for everyone and That's why we do not address troubleshooting in this Kind of scenarios. That's why we have this DevStack VM What you can download from the provided link and actually try out all these troubleshooting steps yourself So the quick question is what can go wrong during a VM instantiation the short answer is everything There are so many ways things can go wrong and we will not be able to cover all of them Of course, we will not be able to cover even the most common one. So what we decided to do we will just show a step by step May perhaps simplified Nova instance creation flow It's it has many steps This is just a generic. We will show you some what can go wrong what can go wrong at each and every step But you can refer to this diagram when whenever you are troubleshooting anything in general, it's just an interaction between OpenStack components and have hypervisor and client utilities Again, this is just for reference there the picture might be outdated one day So you can always get a newer version at the dogs open stack org The basic thing to you have to understand is how it components interact with each other and what do they do and Do you want to go? Yeah, I guess in the point it's a scary picture, right? So as you can see we have 16 steps here in the presentation slides all the 16 steps Where something can go wrong? Is documented when you download it you can actually read about it But in this talk right now because of the shortage of time we are we only cover a couple of those steps So first thing when you when you go to horizon or to the office open-sex CLI tool and want to create a VM Open-sex server create as you can see you're the face one problem, which is a so-called user error The missing value of out the world. Okay, so as a warm-up we already saw this we just forgot to configure our credentials After sourcing the open RC we can actually start creating a server with the nano flavor with the zero simmich with the private network Let's call it step one because this is the first step The error message says that it's failed to discover available identity versions contacting parsing cod not find wow that's just very scary what's written there but Absolutely no room for panic here because we just need to figure out what happened So as we can see the OS out URL was not set up on the left side. You can see that what what the problem was On the right side how it should be fixed So if you if you specify the OS out URL then our open-sex CLI tool or horizon will know that where to Go to authenticate and authorize your user Then it turns out that my open stack dot-com where my open stack is running is just now running as you can see I used dig and NS look up the two very easy DNS checker on your Linux distribution and Server can't find My open stack dot-com We need to fix that that means still we have a user error Some error on the client side after fixing it and we can actually connect my open stack dot-com We can try it out the address is gonna be 10 dot 15 If we tell that to 10 of 15 on the 80s port where horizon is running we can see that it times out Then we need to fix that probably there is some firewall error and so on and so forth So these are very specific to your environment obviously. I just wanted to say wanted to tell you that Just because the open-sex CLI tool says something that doesn't necessarily mean that open-stack itself is broken Here the only problem was that our computer laptop could not connect to open stack Then if you go on the operator side, so where you actually administer your open-stack installation and It's still not working. Maybe because apache is not even running then nowadays all the operating systems or the all the Linux systems what we are using ship with system D and you this is how you check the status of the apache service with system D Again, as you can see on the left side It turns out that the apache was dead and you just restart it and then it's it's gonna run And and then when we query whether whether the keystone UWS GI endpoint is up. It says no site matches. I'm sorry It says no sites matches. That means that it just not running So we need to make sure that it's enabled here. You can see how to do that We are still on the very first slide What if we we do get scared what if we are panicking It seems that we can reach now the server everything is running Sometimes it's good to use dash-d-bug with the open-stack CLI tool because then we can Get this request ID and with that request ID we can trace back the call chain through all the Microsoft microservices during instance creation and On the client side, we can check just the request ID and on the server side. We can check the logs We can grab for the log what has this actual Request ID if you if you just go to journal control so that you will list the log of the keystone service Then it's gonna be a lot of messages Info warning error messages if you enable debug debug messages and it's impossible to filter out with your eyes What you are looking for that? with this request ID what you can gather from the client side you can easily search for all those Messages what happened during this request and as we can see there is this warning Which just tells me that the authorization failed most probably I Mistyped my password or something So the thing is that that was only one step from that 16, right? That does not mean that All the steps can involve so many different problems But the basic principle is the same source your open RC Check for example your endpoint list and you can see what what is the endpoint for the compute Service and then we start to create the open stack server and there is a problem Unknown error HTTP 503 Well, as you can see, this is not really informative Then you go to the HTTP side of this comm to check out on 503 and it will just tell you that yeah It's some internal server error. So we still have no idea. What are we going to do first? print the debug log and There is gonna be much more information right now We are not even interested in the request ID because during the during printing the debug log we can find that The the connection is dropped when we are going to that endpoint So on the user side, you can check whether we can reach the compute node 1015 is it looks like that? It's not it's not even reachable. We cannot ping it So first you fix that if you can ping it then you're fine. You try again the comment does it work? Maybe it does not so go on the operator side go on the administrator side log into your boxes and check it whether the the 1015 slash compute slash v 2.1, which is the endpoint for Nova works and Yeah, so what we can see here is some html page It will just tell us that the requested URL slash compute was not found Why is that all the open-stack market services are running and and an API endpoint via Apache via the UWS GI Rapper and as Apaches tells us the requested URL just cannot be found So why not because most probably some for some reason the Nova API the compute API was disabled so what you can do is With ATN site with the Apache site enable just re-enable that site and then the same core message Will tell you that it works. Well, obviously you need to authenticate Next step on this floor chart is Nova API queries keystone for authentication and authorization of the incoming request So what happens here is? Nova API in the background tries to communicate with keystone. So what happens if it can if it can't Then we have an unexpected API error. Okay, so at least we have some information here where to report this But actually we can feel that there is more to this there is it's not just this API error something has to happen in the background Again, you go to the operator side and start debugging get the request first from the client side and then Grab for that request and then grab for what error happened and as we can see It's a quite long message here, but I put here all the more important information there was a discovery failure which says that we could not find the identity endpoint and It also also tells us that the outer your eyes may be not correct if you check the configuration file of Nova then there's a lot of Configuration options on how Nova would communicate to keystone The next step Nova instance creation for The API checks for conflicts within the database. So after we resolve the problem Nova API cannot connect to keystone for whatever reason It might be the case that it cannot connect to the database You kind of get the gist already, right? most probably there is some connectivity error in between or Or a database user password error and so on and so forth It's again the same principles you are searching for error or warning messages You can see that there was a DB a DB connection error We could not connect to my SQL on the given server and then it even turns out that The problem was the password the the access was denied Most probably because the user name or the password was wrong. So again, what you can do check the configuration file of Nova and Obviously since this is a Nova database error, you are most probably interested in the database Section and the API database configuration option As a last step in this presentation, I would Show you that it's not just the database, but the message. You also can have a problem This time is it's gonna be a very very very long time until it times out and Unknown error 503. So again, we are not Really happy with this debugging information. So again, you go to the server graph for errors and warnings And it will tell you that the AMQP server on that address is unreachable. So you most probably just Try tests with system the letter the MQ is up. If it's up, then you review your configuration options Even go to the rabbit MQ troubleshooting guide and so on and so forth. I think we will Skip a couple slides now so that these are the slides what you can Have a look at at home and we will go with the instant creation flow nine with Mars Yes, so one of the common the most common errors I guess everyone has seen who has tried to run a virtual machines in open stock is no valid host found which is Not very helpful message because you don't really know why it's gonna find a valid host and There can be of course a good multiple multiple reasons why it couldn't be found but for this exercise for these slides What we try to do we will just show you a simple example What would what can be happen what can happen what could have what could be wrong basically one of the examples We are just we have I'm not sure if we have had flavor Well, anyway, you can see the answer here whenever we try to create Whenever we try to create a new virtual machine we have multiple components from the slide number Let me go back from this flow it we have many many components and and inside Nova interacting with each other and For instance scheduling we usually have Nova scheduler. I'm not sure I don't really see in the middle And that's a component which decides Which host will be which compute note or which hypervisor will schedule our? Virtual machine in our case what's happened is Step nine what we try to do is we cried that a virtual machine with a filter and you can see that when it goes If you enable filtering in Nova scheduler filter, you can see right away Every filter we have multiple filters, of course by denied by default and you can see by every filter that we have retry filter We have filtering and waiting So we have a list of compute nodes passing in and going out after every filter So we would say we have 10 compute nodes It will pass 10 then RAM filter will filter out hosts which don't have enough RAM to handle this virtual machine It will return on the five hosts back to us same for the CPU filters same for This filter and other and no no more topology. It's a stuff like that So you can see in a normal scheduler filter that we have start one and end one This is our number of compute hosts which we receive and which we Basically filter lets out or allows to schedule on and in our case You can see that it was a disk filter which gets one host and gives us zero It means there is no hosts. There are no compute hosts and again This is just for demo. There are no compute hosts which could host our virtual machine with required disk and the reason was because we created a flavor with a very big disk and That was just an example But again, if you follow the novel logs on our scheduler locks in this time That's the main output and just keep in mind another thing important thing to Cover is that this is Nova scheduler and we in Nova We have an ongoing effort to go to placement API and things will change a lot in the next Stein release I guess a train when we have a placement API deciding as well Which resource provider which host can host is our virtual machine? So we do the both scheduler and placement cover deciding and Helping to choosing which computer will be Helix the workload Yeah, just for the sake of completeness. I I described what is happening on On that flow chart There is one important thing that all the troubleshooting steps for connecting to glance in there and Neutron is similar to that of what we had in Nova obviously the Nova is a little bit different because usually know what has a Dedicated the Nova compute nodes then obviously for example the lock searching has to happen on the Nova compute nodes There is one more thing of what We haven't mentioned yet is that at the very end what why you why you are creating a VM with Nova because eventually We want to connect to it. So if you cannot connect to the VM the next slide We'll show you the troubleshooting. I am not going into that right now But it's basically three main steps at the very beginning again. No panic Check whether the VM was successfully built. You can check it with the Nova server show or horizon wheel Display to you whether it's in an error or build state Did it get an IP address again? You will show it on horizon. So no problem and The very important thing what even after many years of OpenSecond administration I still forget is the security group configured so that ICMP and SSH ports are open There's one interesting thing what we promised to include in this talk and this is how you recover your keystone admin access Mars So I got this I had this Happened to me only once thankfully not on what in my open stock But this is how you recover your admin credentials for your for your keystone if you completely deleted your Open RC file and you don't remember the passwords and stuff like that This is Because we have full admin access to the keystone configuration files. We can luckily grab Open stock and basically the token from the configuration file In keystone.com and we can grab grab our URL from the same from the same file If you have these two parameters, you have complete and full access to open stock unrestricted admin access to open stock So it's very important that once you change your password, please do not use the same Talking based out of the occasion, please do reconfigure your keystone to work different way, but Generally is a recommendation from developers and I guess from everyone is to avoid using talking with the occasion because it's Not really very secure and also protect your keystone.com file and of course access to your controller because you can Do everything if you have with this if you have these two credentials. These are just a very basic Scenario very very basic troubleshooting. How do you do? How do you really recover admin access if you accidentally delete open RC file? Of course, it's better not to delete it We have general troubleshooting tips and tricks next I guess so again, just do not panic identify reproduce the problem and When you're sure that you actually Run into the problem, so it's not like network connectivity issue from your client to the opens the cloud Then start collecting the information first on the client side the client to us What what version are you using? What is the environment for it? What is the debug output again? The same in the server side. What are the tools? What is the configuration of them? What are the locks? Enable the debug output and what is that debug output and then check the environment because it might be complete Externality is the networking working operating system is the is updated. What are the dependent services? Do you have memory disk space and so on and so forth? Fix the trivial issue if it if it's that like for example You mis-diped your admin credentials on the spot if it's a little bit more removed than trivial than just check it in your Home lab and if all of these you still did not find the answer for your question the solution for your problem Then ask for help use the web search first because someone else might already Hit this problem reach out to the documentation start to talk people on IRC Talk to your support guys talk to developers If all this is done be careful with the mitigation because you just do not want to go to one of your production open-stack setup and start fiddling with config files So test it first and then mitigate and what is really important and I always forget to actually Document because two weeks from now you will hit the same bug and you will have no idea what was the solution for it so Here are a couple tools to use To collect networking and operating system metric information There's a very good Graph of all the tools what you can use in a Linux system to gather all this information And at least once check it out so that at least you are familiar with what kind of tooling is it at your disposal The most important thing though watch out for the non-opens to create the dishes For example when you cannot connect from your laptop to the cloud there might be a resource exhaustion your omkiller kicks in because there is Too high memory usage you ran out of file descriptor limits your physical node fails and so on and so forth Connectivity is the other very common problem. Maybe it's going to be just an IP address coalition. Maybe so your switch is down and completely inter DC problems can be arising from time synchronization problems or Just an external network mixed configuration for example your DNS or firewalls are not set up properly Yes, one of the small things to mention is that we have we used to have let's say project specific CLI tools And now we have a common open stock CLI tool So and sometimes very very basic question, but what version of open stock do you run? And if you want to learn that want to figure it this out The easiest way is to run this open stock dash dash version and you get the version And this is if you go to this web page, which has releases for every open stock real open stock component And they were version numbers and on this web page you can see that for this open stock CLI client 3.18 is Stein so this is a current release basically Again, you can use the most recommended common to CLI is to use open stock or you can use the older client You can use no while syndrome glance or keystone If you want to install individual clients you can install them with peep these are to the commons with python and Again, if you want if you can use open stocks dash-version or no dash-version again You can go here and then to find out the version of your Components and they open stock version in general like is it rocky is it Stein is it kilo or anything like that? Another example we have exactly pretty much the same scenario. We already covered a step-by-step this is just Same comment and a single web page in the single snippet and different Stages, let's say once you run this comment open stocks of recreate what will happen next we set up our environment We parse arguments for the command we request from request for the application and authorization from keystone and we start to request an image and You can see all the output or where it goes. What kind of request there is what what what do we what are we looking for? Etc. Etc. Etc. The request flavor One don't be scared if you see for example the response for zero four in this case Because what happens is that by default? We ask it for flavor and one Nina Nano and by default it will go to look for ID Flavor ID and one Nina and of course there is no such a flavor ID but after that it will start searching for flavor name and It will find it. It's dropped. It's not shown here, but it will show you by Later on it will be successfully created So this is This is just an example what you can achieve and what you can see it from a single flag dash dash debug so if It's really really important to use it whenever you are feeling that something is broken bunch of other commands We can use on a server side and on a client side Again if you do want to enable debugging before it's a good idea if you don't have it enabled just set it to true and Your configuration files This is just like some examples for normal for example, how do you enable debugging? After that you have to basically two options. You either have logging enabled going to journal CTL Or you're going to the bar log files in any case In journal CTL, it's a bit easier. You can filter out by nanoseconds. You can filter out by timing Let's say by since last one hour or last I don't know five minutes or since today It has it has a lot of different options filled here So you can check those options in a system d.time if you have if you're not I don't I never remember them So I always go here so You can either query by in journal or you can query in bar log files your debug messages for open stack components The most important slide I guess from today may be a mobby not the most important but where to search for help and This is a lot there are a lot of places where you can ask for help in open stack and Since we are very lucky because it's used by many many people and most likely your problem is not unique to you and someone else already face it and fix this issue so Try to follow documentation basically read documentation try to just check for wiki Not everything is updated and wiki be careful about it So it checks a date when it was last updated it might not be relevant to your open stack release So check the wiki you can also read how the certain functionality is supposed to work at the specs of open stack.org How it's supposed to work how it's supposed what it's supposed to do what's supposed to receive what it's supposed to output You can check if you know how if you you can check more details If you know let's say to if you want to interact with developers You can check a lot more tools on the right with open dev and launchpad and storyboard Basically our old bugs and launchpad and our new bugs are in storyboard and the sample blueprints The most important three links are these three is you are an operator or what? administrator Ask open stock.org. This is more like a Q&A website. I don't like You can ask any question or you can search for help here. It's It's really really good. You should try to check it out. We have amazing community on three note hashtag open stack If you you can ask any questions there related to open stack if it's really project specific or component specific You'll be redirected to a project channel most likely. We have lots of developers hanging out here all quite a lot quite often don't be Afraid to ask any questions. Don't be afraid to ask any for help people Community of the best part of open stock is open stock community. It's all Open we're all trying to do the same thing. We'll do it all trying to achieve the same results. So IRC is great use it, please and of course there is mailing lists You can subscribe at least open stock or open for discuss is the main mailing list for everything related to open stock and between Again, you can ask if it's really difficult. You can ask for help there Try not to abuse it, but there is an etiquette. You can read it on that list. I think it's linked here But you can trust to subscribe and read messages what's happening right now. What's going to happen next and stuff like that We also have a bunch of yes, sorry because we're running out of time So the most important thing that you gather your do your homework gather your locks your environment And maybe you will fix during that time your problem if not then go to IRC mailing system So on and so forth download our slides, please You will have a lot of valuable information there and you can just check out all the links later and Right now we are open for a couple minutes for questions. If anyone has any Yep, yes, sir, can you I'm sorry. Can you use microphone, please? We will have to it I was wondering if you have any general Debug suggestions for some of the Apache processes when they start running away with memory how to track down Debug that This is I'm not sure if you can try to if you heard before there's Brandon Greg Who's got among us and he has tons of suggestions on how to debug and do profiling of all kind of processes in Linux So yes, I would go with tools from this diagram, especially the bugging tools not only I don't know something like a trace but something even deeper He has tons of different suggestions and he was his own methodology and the books on how to debug different kinds of processes Apache is a very well-used project and program. So I'm not sure I can say anything new about here How to debug it I guess configuration Tracing maybe some yeah, but if you especially this kind of Resource exhaustion what happened to you can be easily tracked down by just knowing that there are tools and at one point It's gonna be one of your basics in your tool set that when you as a sage into a server that immediately You know, you will check the disk and memory and CPU is easy. Who has Well, then thank you very much for coming Please download the slides and do not forget to rate us in the open stack summit. Thank you. Thank you