 All right. It's just about time. Let's get started. My name is Kashyap Chammerity. I am a virtualization and OpenStack contributor at Red Hat, working in the OpenStack engineering team. Today I'm going to talk a little bit about what kind of debugging mechanisms are available to troubleshoot virtualization drivers in OpenStack. Primarily we'll focus on Libvert and QMU, which are the default open-source drivers in OpenStack. First let's see why. In a typical OpenStack environment, which is not a proof of concept, you typically have more than two compute nodes, multiple Libvert daemons involved, multiple QMU instances involved. So to troubleshoot this kind of setup can become cumbersome when you don't have effective tools in place or if you're not doing a systematic manner, it can be overwhelming. So, and for instance, when you have a complex operations like migration, it can be difficult to track down or pinpoint where exactly is the root cause. When you're doing operations like migration, you have multiple compute nodes, multiple QMU instances involved. So if you want to track interactions, you need to have certain tools in place. So we'll see what kind of tools we have at our disposal to troubleshoot these things. And also some of the log patterns that you can find with tools that Libvert has to offer. What kind of bugs? For instance, yes, the thing that you're seeing on the slide, they're not really specific to OpenStack. You can see them in all kinds of environments, the Tyson bugs, the notorious bugs that only occur when they're not typically easily reproducible because when you try to reproduce or try to debug at a certain level, you can't just replicate the bug. For instance, in OpenStack, and the number that the bug that you see there, it's just an example that's just not replicable in an environment outside of OpenStack CI. And bugs introduced by load. OpenStack CI infrastructure runs about 800 desktops per hour, so you can imagine what kind of load that's generated. And subtle issues that are hard to track down. There's plenty more. No more. I mean, we are the OpenStack Summit, I'm sure you must have heard enough about OpenStack. No more already. So just a quick overview that it runs your compute workloads, runs, tells schedules, Nova instances, interacts with underlying virtualization drivers, let it be KVM, QMU, Zen, Parallels, VMware, many of them. So the thing that you see there, it's just what kind of virtualization that's supported. You can specify the word type configuration in Nova's configuration file. So what else we've got? All these things are available and slides are posted online, so you don't really have to click pictures, I mean, just noting it, but still you can go ahead if you really want to, if you insist. So some of the KVM virtualization building blocks. So I'm sure how many of you are aware of KVM and QMU? Okay, it's just about all of the room. So I think I can skip this, but yeah, just KVM is the kernel module that's part of Linux kernel, that's popular virtualization mechanism and QMU that does all the device emulation, your tasks, sound, PCI, etc. And it supports about 17 CPU architectures. I was reading about it recently. And the commands you see there, that will just enumerate what kinds of devices QMU supports, what kinds of CPU architectures it emulates, so you can, the command line is really crazy. I mean, I don't have it handy here, but if you do a PS on your compute host and then grab for the QMU process, you'll just see the monstrous command line. And LibVert, which is the hypervisor agnostic virtualization library that interacts with QMU via two mechanisms. One is command line arguments and the other is the QMP interface and it's called QMU machine protocol. So two mechanisms through which LibVert interacts with QMU. And these are the default drivers in OpenStack. Yeah, this is just a silly, small ASCII diagram that just shows what we've just saw. KVM, you can see slash dev slash KVM character device that is part of the kernel that you can just do a file if you want to check. KVM is present or not. The quickest way is to do file slash dev slash KVM. You should get an output saying it's a character special device. And QMU, your guest runs as part of QMU process. So QMU is just another new process on the host. So like your Firefox or Nautilus or any of the tools that you use, any of the user space processes on your host. And on the top right you see LibGestFS, that's another versatile tool that will allow you to rescue disk images, inspect disk images when say for instance if your guest dies due to some kind of ASCII Linux problem you can fire up LibGestFS to examine the guest and fix the problems. It also has a wide range of virtualization tools. So I highly recommend it if you haven't checked it out yet. Yeah, let's see what utilities are available to debug the compute process in Nova. Yeah, I'm not showing all the Nova services there. Nova has plenty of services, so API through which you get the call. AMQP is the protocol that's used to communicate across services in OpenStack and Nova compute process interacts with the underlying virtualization driver through the word driver interface and LibVert is the tool there. LibVert interacts with QMU with the QMU machine protocol. So that's just in a nutshell. There's lots of different tools that are at their disposal to troubleshoot virtual machines. For instance, Nova like any other, like you could expect in a typical OpenStack environment it offers your debug error messages when you check the Nova compute logs or API logs. You can get those things by enabling the debug and verbose flags in your configuration file but there are two verbose to finding any meaningful detail if you're trying to investigate complex problems involving virtualization drivers. So for that there's more tools that LibVert and QMU offer. So I'm not going to list them all. We'll see a couple of utilities that compute offers and then dive into a few of them LibVert offers as an example. And then we'll see an example of a real bug that's how to troubleshoot or how you track down the root costs. So Nova has this Guru Meditation Error Reports framework. This was introduced by a Nova developer Daniel Verange and Soli Ross, another Nova contributor. So it just provides you, not just. It does provide you the whole, if your Nova process is misbehaving you can say kill, supply a Unix signal called usr1, usr1 and usr2 are your Unix user-defined signals and say give the process ID of Nova compute and then it will print out a whole large error report on your STD Error Stream. So it could be redirected to a log or probably wherever it is redirected on your next distribution. From a Tucker release onwards the default signal is usr2 because there's a collision that another component Apache, mod, usr1 for its own purposes. So we had to change the signal to another user-defined one. So that's the thing which will be the default from the Tucker release onwards. So the cool thing about this Guru Meditation Error Reports framework is there's no prior action necessary from the administrator. So you can just trivially kill process, can be compute, API or whatever, and then it will print out all kinds of things like configuration details, threads and package versions, et cetera, et cetera. So you can check out the example error report. If you can just Google it, Guru Meditation Error Reports and Nova, so you'll find plenty of them. Okay, since the TOX focus is on Liberty and QMU, let's see what are the tools that are at our disposal to debug virtualization drivers. Most of administrators who are dealing day-to-day with virtual machines would be aware of the logs specific to guests via log, Liberty, QMU, and VM log. Nova instances are really, they're not named as VM1, VM2, they're long UUID plus your instance.logs. So they contain typically the Liberty-generated QMU CLI and Liberty Standard Error Stream and any error messages specific from QMU or you'll find them in the VM.log, and guest-specific logs in that directory. So it's pretty handy when you're debugging virtualization drivers to this first place to look at. Liberty also offers granular logging infrastructure to capture error messages from. Damon, Liberty, Damon as the last client side. So specifically, the Liberty log filters are more useful because you can say, I want to capture debug for so-and-so component and error messages for the rest of them all. So that will allow you to capture the areas that you're specifically interested in and ignore the rest. So that's Liberty, you can set the log filters in Liberty, log.conf, etc, Liberty.conf, say, okay, the thing that you see there is log filters, I want all debug information for QMU, Liberty and security, and one stands for debug and three stands for warning and error. That's just code that's used internally. So, and then say, please redirect all the output to a specific log file so you can specify why log Liberty just make it consistent or anywhere of your choice depending on your storage. And don't forget to restart the Liberty Damon, so the change takes effect. What else? Liberty also has API logging that you can set via environment variable, so you can just say, please provide, you can just export this environment variable so Liberty will dump all the API related logs and all the calls that Liberty is making on your STD error. And also the nice thing is you can redirect these outputs to two streams, either to your system.djernal or to a file or both together depending on how you want to set up your logging infrastructure. And that specifying multiple log outputs is applicable for both your API logging as well as Liberty Damon logging. System.djernal, System.djernal is very useful for debugging your system services and it also has, in this context, it also has LibWatts specific journal fields. So it's in a structured manner, it throws out all the details. So if you're seeing a LibWatts specific error, it clearly points you to a specific source file, which is a design function. So it's pretty neatly structured that you can examine your system.djernal. These are just some of the example commands. You can say, journal, please tell me all priority errors for LibWatts Damon since today. So it's very flexible. You can query to hell and back with various switches and you can monitor just like tail minus F with classics this week. You can say, journal CTL, LibWatts Damon dash F and you can follow the LibWatts Damon related errors. So yeah, feel free to stop me in between instead of me just drawing on. What else? You can also have QMU offers the ability to live query your virtual machine which is exposed via LibWatts. So QMU monitor command and QMU monitor event is one of the most interesting ones. So monitor command allows you to supply any QMU command via the LibWatts shell interface called VERSH so that you can either query the state of a virtual machine or modify the state of it. However, if you modify the state of a virtual machine through QMU monitor command your LibWatts warranty will be gone. So you can query. Quering is fine, but if you modify the state of a virtual machine that will be going behind LibWatts back. So that's only in dire situations you might want to do. And monitor event, as the name is self-explanatory you can monitor all kinds of events specifically the QMU. Monitor events for instance if you're doing a live block migration or live migration you can fire up the QMU monitor command the LibWatts shell interface for a specific NOVA instance and you can see all events whether there are any errors you can live and monitor the fine grained details of interactions between LibWatts and QMU. And there are a lot more utilities if you just query the man page of VERSH the LibWatts shell interface you can see a lot more details there. Yeah, the syntax is that for monitor command so much more easier or simpler one sort of writing the JSON that's a JSON syntax that you see there but there's two ways to supply commands there is a human monitor protocol and then there's QMP but the HMP the human monitor protocol is being deprecated so QMP is what also is used under the hood by LibWatts so this is what I kind of train myself to get used to it's also easier to remember because you can map if you do it enough you can map what commands LibWatts is sending to QMU so you can also use the more if you query the man page you can see both details of what's possible syntax so the command that's just saying that please execute the query command QMU command that will enumerate all possible commands that QMU supports on your shell interface so you can see, okay, it supports all these kinds of commands let me see what's interesting to me and then you can see if it's relevant to the problem you're debugging and then invoke it so query commands that's just a small output that's a truncated output there about one time if I commands you're not enumerating all I just listed four of them two of them are query commands and the rest of the two are modifying the state of a virtual machine so the dry mirror that you see there we'll get back to it when we see an example at the end tying up all things that we've seen so far that's useful for doing things like continuous disk replication or live storage migration when you do a live block migration in Nova that's the underlying primitive used in QMU and at the top you see query events so you can see tell me QMU what are all the events that are supported so you can see the events there so there's a lot more things when you run that command on your computer node you can see what are all the possible options one other one, query block this will just enumerate all the details about the block devices involved on your virtual machine so if you have a nova instance you can see all the details like what is the backing file of your nova instance what are the IO operations what is the state of if there are any IO ops in progress it will enumerate those kind of details so I truncated a lot of output just to make it a bit clear so you can see the virtual size of the actual nova instance and if there are any backing files involved in it there are involved usually you can see the details of that as well yeah, the monitor event for instance is saying that you can say block migration is one of the frequently used utilities or operations in your open stack environment so you can say please print out in a loop all the events that are happening monitor events that are happening for a specific nova instance so you can see that via this command so we quickly run through lots of different tools for example there is a lot more that we can talk about but it's not really interesting to enumerate all of them so let's see a small example of tracing the flow of a live block migration from nova and where we end up with that first, why this example in a non-trivial open stack environment like I said we have you have two compute instances at least say you have two compute nodes involved two compute instances two curmew instances so you can examine operations between both source and the destination and see all the interactions and if there are any errors and a whole lot of detailed information that you can monitor so I thought this is a cool example while I was making these slides I stumbled upon a real bug so I thought why not try the same one as an example so that's the small syntax that you use when you want to do a live block migration say nova live migrate block migrate for this VM to this destination so block migrate essentially means it will also along with your virtual machines memory it will also copy the disks as well to the source so when you invoke that nova's leverage driver sets a bunch of flags by default there are some flags set but you can modify them if you know how live migration works at a more advanced level or if you know the details of it or want to do something specific you can still supply flags as well so but there are defaults and you can configure them still so what happens when you invoke the nova block migration this is what the command that you see on the screen is what libbut is invoking under the hood nova's virtualization driver is essentially making calls to this infrastructure so this is just a wrapper shell verse libbut shell command what it looks like the same one we saw before if you have to invoke it directly with a libbut shell that's what it turns out to be so I didn't show the output of the command when you run it just prints out details of the block migration but that's not very interesting to see so yes that's the command that nova is calling out under the hood nova's libbut driver is making calls and what happens when you run that so when you run that apparently there is an internal error and guest unexpectedly quit that's pretty nasty if you have a database or something running inside it I mean this is a disaster but it's not really helpful message internal error doesn't mean anything I mean it's not helpful at all so let's see what else we can see that's what standard error says for now when you invoke that live block migration command like we've seen guest specific logs are located in viral log libbut there's couple more errors there again the internal error it doesn't mean anything what's an internal error and we go further down I truncated some output there and libbut assumes the guest crashed but we don't know if it's true or not so let's see if we can find out if the assumption is right or not so we've seen the error from libbut so next libbut calls to QMU let's check out what's the QMU error the QMU specific log is viral log libbut QMU VM1 well like I said VM1 is not a typical name of nova instance but that's just to keep it brief I mentioned it as VM1 and it says a cryptic error message coroutine entered recursively and the guest is shutting down so we now know that at least QMU also thinks the guest knows the guest is shut down what else since libbut assumed the guest crashed we can use tools like system decoardom CTL to see if there are any coardoms or stack traces specific to QMU process so you say coardom CTL and there you see a process associated with QMU binary so libbut assumption is confirmed that there is indeed a crash so you can query the coardom CTL tool to provide more detail about the specific crash or event that it logged so it can enumerate the stack traces if you have the respective debug packages installed you can see a lot more detail so this is just an example that you can either if you have the respective knowledge involved you can either fix it or report the bug to QMU component for your respective distribution and what is the root cause it turns out to be a regression in the guts of QMU disk mirroring code so that's fixed by that comment that you see there so it's just a small example of how you can track down a problem occurring from NOVA all the way to QMU I think earlier I mentioned that this is VM's log but what we saw the first message the assumption that you saw that's from the words daemon log that we've seen lock filters configured earlier that's what we see I erroneously mentioned it has a virtual machines log but this is the daemon log as you must have seen the name so yeah that's a small example that you can see so there's more you can examine when you see the man pages for libvots where shell or QMU source has also some of the documentation but it's more low level so if you're interested in that area for KVM we don't have that much time so for KVM there's fortunately a good talk by another QMU developer at Faustem earlier this year if you're interested in KVM virtualization I would recommend you to check that out it's a very good talk and also there's another one earlier this year last couple of months ago at KVM forum by David Hildenbrandt on guest debugging in KVM that also goes pretty low level in KVM and Kernel so if you're interested in that area I would recommend that talk as well how much time you've got yeah and I just got one more slide and that just shows the previous example we saw was an error case so the guest crashed but in a successful case you can see the interactions between both source and the destination or libvot daemons for instance what you see on the slide is I'm just gripping for what commands libvot is sending to the destination what commands the source to the destination node so you can see the drive mirror command we've seen earlier I don't know if you're paying attention you can see the drive mirror command which is used for storage replication life storage migration and you don't need a shade storage setup with this that's a nice thing with this so you can just do a life storage migration or keep on doing the continuous replication as long as you want to determine it gracefully so that's the command you can see that source libvot daemon is asking QMU to execute so vice versa you can also observe the daemon logs of destination libvot daemon and see what interactions are going on there so that's about it if you have any questions and yeah so you want to track the IO details you want to track the VM to all the way to host yes there is a domain block stat command for libvot that you can see the block statistics of a guest so there is a command and it maps to a respective QMU command also called block statistics or something I don't recall it at the moment but yes there is a mechanism for that if you just do a man verse and then query for block and you can see block stats yes there is that and there are more tools as well there is a nice page that I saw on LWN.net or one of these sites where it enumerates all kinds of tools at different layers of stack that you can observe so yes that's one thing that comes to mind immediately any other can't see it's too bright here so I'm wondering if there's anything is that life cycle events guests shut down yes there is life cycle when you do QMU monitor events you will see a bunch of events related to guest life cycle yes guest life cycle events are supported you can see the events related to that so when you run that you say QMU monitor command for an over instance in a loop so you can see if you shut down the guest it will enumerate the details of guest events again? well I think that comes down to NOVA you can see what commands that are passed to QMU from LWN so if a higher layer tool initiated the event then you have to use an external tool answer it well it depends what is triggering you have any external scripts or tools handling your watching machines it depends on what kind of external tools you have and what is triggering at libVert and QMU level I think this is what you can see that life cycle events are happening what is triggering an external tool I think it depends on your deployment environment so what's involved in it anything else? alright then, thank you