 Okay, I can just put it in the pocket Hello everyone, I think it's time to get started And thanks for staying back for the almost lost talk. I think there's lightning talks after this My name is Kashyap Chamrati. I work as part of open stack Nova engineering at the moment But I began as a test engineer a few years ago at Red Hat Today I'll talk about What kind of debugging utilities lower layers of virtualization have to offer? When you're trying to deploy open stack or when you're debugging at an open stack setup So especially when you have multiple in any non-trivial deployment you have lots of Compute nodes thereby multiple compute and multiple deliberate daemons and Cummins tenses and so on and so forth. So Let's get started And then after that we'll see a small real-world example of that ties together some of the things that we've discussed at the end so Why as I mentioned why this talk and so when when when you're debugging a large system like open stack Often virtualization drivers is where you may end up not often maybe sometimes You end up well and debugging and problems say for instance live migration. There could be an n number of flows That's possible with live migration. So that's one of the complex areas where things can go wrong in a subtle manner so to find relevant log patterns and Could could become complex and cumbersome when when when debugging this so Let's let's see what kind of box. Well, you're all well aware of Plenty of different kinds of box. That's possible Crashes things that occur only under a load and I I linked a bug there And that's one of the notorious box that shows up in open stack Nova infrastructure Their CI infrastructure only However much we try to debug it independently. We just couldn't reproduce it It's an interesting bug if if if you're curious you could take a look at it after The talk or later sometime Then an open stack CI infrastructure runs about 800 test jobs per hour It's not a joke. It's real. You can see the URL at the end and that that that Shows you what kinds of jobs it is running So Briefly about Nova who knows about Nova year Almost everybody Okay for for the one or two who are who are not familiar with it It's essentially the compute part of open stack which which which is responsible for creating a disk image calling the relevant virtualization driver Responding to calls to to monitor the state and etc So it supports multiple virtualization drivers as you could guess KVM and QMU are the default open source drivers and Multiple other drivers are also maintained like Zen and VMware even LXC and so forth the the config attribute you see there is is part of Nova's Configuration file where you specify the word type and bunch of other LibWord specific configs Quick tour of KVM virtualization building blocks. Most people are very well aware of this here But for those again, we're not here We're not familiar with it and a quick idea KVM us and if you were Mikhail stock He he mentioned how do you enable KVM and and how do you do that in BIOS? Maybe not by us, but general ideas. Okay, it's a kernel module. You have Intel's VMX and AMD's SVM Instruction that enable that allows you to perform hardware virtualization and QMU is a device emulator That allows almost 17 that that that emulates about 17 CPU architectures And and it does all the disks and network and display Devices so if you want a complete PC like environment QMU is the is the thing that is doing all the heavy lifting the two commands You see there They and they just if you run them it will enumerate what kinds of devices it emulates and supports And in the second one this dash CPU it it lists what kind of CPU as QMU emulates and Q me interacts with liberate via the QMP JSON RPC protocol via socket So it when you're when when liberate is talking to QMU it constructs QMP commands and Talks to it over this and Jason RPC interface and liberate and as you've seen in the previous sessions Is the hypervisor agnostic and library that allows you to manage multiple? different hypervisors This is just a silly ask a diagram that shows the same things that we've discussed a few minutes ago Okay, let's see a quick couple of things that that that are available in Nova utilities for debugging and as we've as I've mentioned a Few minutes ago when you click off when you request Nova to create an instance So it gets a call and it calls the API and then the scheduler tries to place the instance among any of the any number of compute nodes and then the compute Nova compute process interacts with the Liberty virtualization driver and that interacts with QM QM via the QMP and protocol Feel free to interrupt me. But when I see people around here, most of them are very Familiar faces that I see who are deeply involved in virtualization. So Maybe I'm preaching to the car here So Nova compute and like any Large project that you expect and it has this default flags that you can set in its configuration file that allow you to examine a bit more detailed and Flow of requests and track down where the issues are and Nova also has this thing called guru meditation error reports that that I'll discuss in a minute followed by that and Libvert has few and very and Detailed utilities that will let you examine a request in a very granular fashion and the Libvert lock filters is one That that will let you do that. We will we will discuss that also in a few minutes and the and liberal shell wrapper worse tool as well followed by that for QMU the QMP commands and the and HMP commands are Two of the things that you can use to query live query or modify a guest state, but Modifying the guest state behind Libvert or the higher management layers back is really frowned upon because It's it wards your support warranty When when debugging I I personally do use the QMP commands because Libvert and Uses the QMP interface. So I just make my eyes tuned to looking at the Jason blobs of QMP commands and Kbm well This talk is mostly about the word in QMU, but for KVM There's a talk link that I and and that I have at the end in the references. So that that could be That's a very nice talk and if you want to dig a bit more deeper. So this is a bit out of scope for this talk So yeah, I grew meditation report until and this was available There was no way to get a proper error report to If you want to kill a pro nova process and see get state of life and nova and process like the API process or the compute process. So One within with the introduction of this guru meditation reports and You can now supply a user defined signal either us are one or sig us are two and Give a process and so you can do so that it can To the kill command and then it will immediately generate a large report That that can be examined Good question the question is and is this signal specific is this Error report mechanism specific to Nova or other companies in it is currently wired up in Nova There is a there is work in progress to enable that in other Projects as well, but I'm not tracking those efforts. So probably They're they are already some of them might have already been merged. So We can check that and in the from gate So earlier the signal used to be sig us are one, but it turned out that some of the some Apache process Reserved the sig us are one signal. So we had to Change the default signal to sig us are two so from the next upstream and open stack release onwards the the signal to use is the us are two So, yeah, that's just a brief and run down of what kinds of things that you see in the error report Of of guru meditation. There's a there's a link at the bottom that you can refer to That shows what kinds of things that it enumerates the most useful thing is the configuration detail and sometimes People when report bugs or or if you're using a downstream product when customer reports box They say they've enabled so-and-so thing, but it oftentimes it turns out They didn't so Just so that you can get to eliminate that bottleneck you can get the configuration details now when you Generate this and an error report framework By the way, if you're wondering about the guru meditation quote and the phrase it's apparently For old Commodore machines when they crashed that they use they used to generate such an error Message and so it's an origin from there And this was written by this guru meditation report framework in Nova is written by my colleague Daniel Baranche and And and the nice thing about this is you don't need to have and you don't need to set up anything or configure anything To to trigger this so it's it's no action is necessary from the operator so Let's see what what kinds of debugging controls that Libert and QMU offers most often the first thing that you have to look at is the virtual machine and specific logs that are located in viral log Libert slash and QMU followed by the instance name instance log And that contains mostly the Libert and generated QMU command line followed by the error messages from QMU So that's a useful place to look at if you suspect there is any QMU issue when you're when you're debugging From a higher management layer so from then on you might get further closed by enabling lock filters so on and Libert's standard error stream is also redirected and to the lock file Yes, a few minutes ago. I mentioned about the Libert lock filters. So lock fill and that's that allows you to That allows you to examine the Libert lock file at a more granular fashion so three main things that Libert think Libert has is log messages log filters and log outputs so filters is just essentially a set of patterns and priorities that you tell Libert Damon to log for instance if you want to capture only things related to QMU driver you can specify. Hey, please only capture Debug priority for the QMU driver and error slash warning messages for the rest of them all So that's that's just That's how you enable it and you supply these config variables in Libert's Libert Damon's config file and you Start restart the Libert Damon followed by Triggering a test from the beginning so you can see what's going on That's for Damon logging. You could also examine what kinds of public API calls that Libert is issuing by and setting some environment variables on on the command line and You could see if you Environment variables listed there you could redirect the output either to a specific lock file or to the system the journal Either of them would be fine and or you could set to both of them as well And this applies also for the Libert Damon as well Yes For when you use system the journal It has very nice structured fields that that will allow you to examine what function The error is coming from what code line what what error code and what kinds of what which source file Very granular detail You'll get when when you use the journal CTL To query. Yeah, these are just a couple of examples that you supply to the journal CTL tool You say journal CTL, please Enumerate all priority the last command for instance if you see please enumerate all error priority messages for Libert Damon since today, so that's very convenient and Are useful to in the toolbox two other very interesting commands that that that qme allows and that you could run via liberate are the qme monitor command and qme Monitor events so qme monitor command that it'll just allow you to pass through any qmp commands That that you can run via this invocation so that you can inspect the live state of a of a VM or Modified if you're in a very deep debugging Environment, but oftentimes you just want to inspect a state of a VM. We'll see a few commands Next what kinds of things are available and there are a lot more utilities that Libert offers that you can check out in the manual page of Libert Yeah, so if you run a qme monitor command this way you just apply the VM name followed by a flag pretty that will just print the json output in a in a readable fashion and You say please execute the query commands command that will enumerate all possible qmp commands that qme offers and you can run via liberate So when you run that you see something like an output like this But I just trimmed off a lot of output. They're about 170 or 160 commands so One of the commands that I highlighted there is the drive mirror command and that that will get back that's which is used for Storage live storage migration We'll get to this sorry Another command that that's useful is not useful. That's just an example to query It's also useful to query live block Block information of a of a VM So if you run this query block it will show all block device information of a of a disk image of a VM so if there is a backing file most often all open stack instances Nova instances will have a backing file So you will see information about the backing file the virtual sizes and and and some of the IO ops related information and so on I trimmed off a lot of output, but if you're debugging anything related to block layer or just wanted to see what kind of Detail that's going on under the hood you could invoke this one So the monitor event command and each it would you can see Various kinds of events for instance if you're performing a live block operation like a disk copy or live migration Life storage migration you can see you can enumerate you can invoke this command and then parallely While the disk copy or live migration is live migration is going on you can see the Events that on on your shell where you invoked the Q me monitor event You have to run it in a loop so that as long as the command is running on the other shell you can see and the Events that are in progress and you can observe all kinds of arbitrary event QMP events that you can see what kinds of events are Possible by running the Q me monitor command followed by query events, so it'll enumerate all events that are possible So enough of rambling about what are what kinds of tools are available. Let's see a small example Of how do you trace the flow of a guest crash? During live block migration. I was when I was trying to prepare this Presentation I encountered this bug. So I thought why not use this and as an example for this So first why this example like I mentioned at the beginning When you were doing any non-trivial Deployment you have my you have to interact with multiple and compute nodes and multiple liver daemons So this is a nice example where a live block migration is happening which means along with the guest state and the device state and Memory state you are also copying the disks to the destination. So that's that's what's happening when you say block migration and You can examine what kinds of commands Liberty is constructing to send to Cue me Liberty's asking Cue me to construct and send them to the destination daemon So this is a nice example if you want to see the flows between source and destination Liberty and Cue me instances So when you want to perform a live block migration from Nova, you just say no a live block migrate followed by the block migrate flag and you specify the instance name and the destination Once you do that, there are a bunch of flags that Nova sets To configure the block migration and it performs it by the way as a small digression these flags will now be deprecated in Nova, but it will retain the functionality just so that it's confusing for users and operators to set up all these flags so it's just easier to Eliminate this and then do the right thing by default by default with a tri-state flag That will handle various conflicts with with a single configuration So up until now you have to explicitly set them There are some default flags, but you can set something Specific based on what kinds of things you're doing like if you want encrypted tunneling. There is a specific flag for that So, yeah, that's that's in brief that that's what happens when you trigger the live migrate command and If you want to see what's happening underneath a good way is to use the livered shell interface and Supply the same flags that Nova is asking liberate to construct So that's what you see that you say worse verse is the shell interface of livered say migrate followed by the verbous flag and Incremental storage flag that means a base image is shared same base image is shared between source and destination followed by a peer-to-peer flag which implies The source liver Damon will control the complete migration flag If I'm wrong, please correct me, Michael Yes What is that can you repeat it? Yeah, this is just to see what Nova run if you run the the Nova live migrate I didn't show the Nova locks Nova specific log when you invoke this This is just a way to enumerate what you can run with Libbird instead of Nova Yes, yes, yes in Nova locks you can see what calls are being done if you see it's just the same interface It is using the same API's and that Nova is calling So it's calling the equivalent Python bindings though. So The Libbird Python project is what is and the specific API is the migrate to you are a API that you see at the bottom So that's that's in short Yes, and and when you do that and and the word standard error stream says and guest unexpectedly quit That doesn't give us much information. What's going on? So let's see further what else we can do we can examine the Libbird Damon lock and see What else is available? There is a error message that says Internal error and end of file from monitor which again doesn't inform much If if we are coming from higher level management project like over it or our open stack Some other projects like that If you scroll further down, you will see a bit more verbose error message that says We am one closed without shut down event and I'm assuming the domain is crashed But you don't know whether it really crashed or not But if it if it assumes we are not certain if the assumption is true or not So let's see if we can confirm the assumption as I've mentioned a few minutes ago the the virtual machine specific or the Nova instance specific lock is located in in the Barlog liberty Q me directory there. You could see a cryptic message that says and Coroutine and re-enter it recursively again. That really doesn't say much either So let's see and then the guest is shutting down there. You could see so let's see how much further we can drill down because It get a Libbird said guest crashed and we could examine if there are any Cordoms available, so we could use a journal system the journal tool like cordon CTL to see if there are any Cordoms present we could see that indeed there is a cordon available for QMU process So here we come from the Libbird assumption if you have all the necessary debug packages and All relevant Dependence is involved. You'll get a nice core That you can dump why are the cordon CTL command so that you can extract contents like stack trace from it So if you're providing a bug report to a lower layer project like QMU or Libbird So this is one way to go where you can provide as much detail as you can and you can also see the Segway signal to the segmentation fault that the cordon CTL outlines clearly So at this point, we know that and the bug lies in QMU so and We could report that this detail to in a bug and Which is quite informative you don't really have to Do anything more because it's a moral moral self-contained Bug report with this level of detail which a QMU developer should be able to act and the said QMU developer is sitting right there who fixed it and Kevin Wolf Who would it this turned out to be a bug in a QMS and disk mirroring code? so When I filed the bug, I was looking through the git Source code to see the relevant error messages if if there is any hints for that. So when I was looking at Sorry Well, yeah good question there is no a beer is there any a brt integration a brt is automatic bug report tracking tool to those Who have haven't heard and well there is but I didn't enable it on my system So probably didn't I didn't automatically report so and I filed that bug But yes, and I didn't I don't trust as much the a brt report. Maybe I should this will be recorded so Sorry Often yeah, okay so, yeah, so that's that's one small example that Outlines that tools and utilities that we could use to track down bug It's not really necessary if you're working at a certain management layer to dig dig down We could if you could file the bug to the respective component the respective Engineers might track down, but it's always nice to dig down deeper and then try to find out the root cause As much as we can So that was the failure case in a successful case and to understand what's going on behind the scenes You could grab through the Libre Daman lock to see what kinds of commands Liberty is asking Cumin to construct so you could see and the drive mirror command if you are paying attention a few minutes ago We've seen the drive mirror command when we enumerate it list of commands that why the Cumin monitor command The this is where you could see the invocation that source Libre Daman uses To talk to the destination where there's an NBD a network block device server running Which is taking all the data that sent from source Libre Daman So this is on the source Host that I grabbed for likewise you can do that on the destination and see what's going on there So a few references that and I said at the end last year at KVM forum There's a nice talk by David Hilden brand on guest operating system debugging that goes into much more detail about using gdb and and and more lower-layer Debugging and there's another one by Stefan Hanna see also last year fast them on observability in KVM, so It's also very nice talk. So if you're interested in digging there, you can check that out Yes, I'm lost and a few days ago. I was a fast man. I attended a talk Some of my colleague referred that you should check out the talk called hunting the bug from hell Where Andrew Halley a free Java developer and said something very interesting where he detailed where he outlined the most complex bug he has ever fixed and and he Said the he set the stage by describing the problem statement and then giving some hints as to if anyone in the audience can can guess it Very well engaged talk. Maybe unlike this His essential point at the summary was a lot of us spend time and debugging and not much is written or shared about it So I thought it was a very well set point There we are That's all I've got If there is any questions, there are any questions But I don't know most of them are experts here. We're sitting All right, there are no questions and thank you