 All right, so yes, we can get going So I assume most of the people here are just waiting for whiskey So the I'll be talking about the support for performance tools in Zen It's a little bit of a background when I joined Oracle earlier this year Conrad and I Conrad Wilk and I started talking about things that Needs to be worked on and most of these things were kind of performance related And as we were going over these things It appeared that yes, these things are clearly performance related and They seem to be obvious obvious performance problems except that we actually didn't measure that there was a performance problem and As we were going project to project I don't remember what the projects were it was kind of one thing after another Yes, it's obvious, but we don't know and that sort of reminded me of a professor math professor at school who Go into this insane, you know theory improve and in the middle he would make an assertion and he would say well isn't that obvious and You really shouldn't nod your head Because then he would say well mr. Stravsky in my case then you have 30 seconds to prove that it is obvious hated the guy and So after going for this project with Conrad we decided that we probably need to do something about the tools for lack lack of the tools I guess so Yes, I have a gender so Before I guess we're going to what this project is about it may be worth kind of going over the Methods that people usually use to measure performance First one is timing. It's obvious. You look at your watch and you see how long things take And so if you you know apply a patch and your system took one minute Before the patch and it took you know until next Thursday after the patch. You kind of know that you know problem with the patch So there is nothing surprising there second thing you do after usually after you do a Measure your timing is you start count things and If you only use software usually you count things by you know adding a counter to your code and Incrementing this counter and the end of the run or at the end of the routine you print the counter and say well It was finally the routine was executed five times and hardware people kind of figured maybe we should help and They added support for performance counters for various events, you know cash misty or BMS data load and all these sorts of things so this is all good, but You really need to know Where to count where to insert your counter in case of software? So that's where a sampling comes in and when you sample your Your program your system every so often You interrupt your program you look at where you were interrupted you write it out And then at the end of the day you Kind of look at where where you interrupt at most of the time And that's probably the place that you should concentrate on And again, if you're using a software a typical thing to sample is by using a timer your time interrupts every you know So often and you write out information about where you were for hardware people usually use either performance counters so they load the The counter register with a value and then either increment or decrement it and when it hits zero you get an interrupt And you do the same thing as a software counters are Sort of notoriously in pre imprecise So if you want to really go deep and find out where exactly your program is At the instruction level counters are not necessarily the best tool. So the hardware vendors came up with the special Widgets in their PMU AMD has the IBS which is instructions based sampling and Intel has pebs Which I don't remember what it stands for but they assume s is for sampling and finally People use tracing where they kind of they write out where you were which points of Execution you pass through so again you write out where you were and then you Look at the code and say well, I usually go this path and maybe I should do something about it with tracing again, you just you know insert probes in your in your code and As you pass through those probes write them out for hardware Intel has the branch trace something again As doesn't stands for a stand for something And I don't really know what the AMD story here is I think they'll have something So let's go over existing tool situation with with Zen what we have now so we should probably look separately at guests and hypervisor so for software methods Presumably any guest HVM or PV guests should use should be able to use any tool And I say presumably because I haven't tested all of them But it seems like any tool should be usable for hardware The story is different between HVM and PV so for HVM. We've had for last few years We had support for VP mu module that supports counters for both AMD and Intel and for Intel it also supports BTS and presumably any tool that uses Those features that you run in HVM domain should should work and we know that perf works or profile works I Would think that any other tool should work For PV this the story is a little bit less happy the only tool well that not the only tool I guess the most mostly Why the mostly known tool is Xeno profile? Which is a kind of Written on top of existing or profile that everybody knows and loves And again, that's the only tool I can think of for hypervisor There is a bunch of software tools that are part of the tree actually that you use for To measure performance then trace analyze tons of tools that start with word Zen and to do hardware profiling again Xeno profile is Pretty much the only tool there are probably some home tools We an oracle I think had some internal tool that but it's not upstream of also That's the only one that we can really talk here so So the question was what do we go with what's the tool that we should pick Target for this project. So first of all, why not go with the Xeno profile and because it's a good tool Well, first of all It's not in the tree. So the Zen part is in the tree and has been in the tree for a while But the other two parts the the kernel part the Linux kernel part and the tools part is not and I believe the latest patches are for 2632 if if that and So you would have to port them every time you have a new new release well not every time but often and That's not very sustainable second the old profile on on on which the Xeno profile is based is not really It's in maintenance mode now. So Bugs bug fixes go there, but new features are not So it's it has limited Runway in front of it which is why We decided that that's probably not to do to go and the alternative seem to be perf For exactly reasons that Xeno profile is not which is it's in the tree. It's a default performance measurement tool for Linux kernel For hardware measurement. Well, not just hardware It's in very active development So pretty much any time From hardware perspective anytime a vendor comes up with the new feature in the in the PMU Usually patches from both from those vendors actually follow and It has lots of features some are you know counters and Sampling and all that stuff related and it has also other Tracing features software tracing features I also hear that the people are actually thinking about using perf for power management and even for as I don't know Where this is going, but this is there are some serious conversations are happening about that Perfect is So these are kind of random thoughts that I had when we started working on this. These are desired properties and Interestingly enough, I don't have working to as a desired property. So that's just it's just a side effect so first of all as I said we have VPMU which is a support for PMU for HVM guests and Given that it appeared that this the functionality for supporting Paribascial guests and and the hypervisor would be similar. It would probably make sense to Try to reuse as much as possible of VPMU to support any type of guests and the hypervisor itself second of all I Really wanted at least initially to Confine the Linux site changes at least on the kernel side to Archix 86 Zen I didn't want to go outside the tree for a variety of reasons On the other hand as kind of the thinking progressed you I could see that there will be places where We would need to make changes to places outside those those directory that directory so That kind of implies the stage development so I think what what I wanted to do is the first stage would be we live in Archix 86 Zen and then we try to go out and Also, even though we are targeting perf Really didn't want to preclude, you know other tools from using it like oprofile and we're just talking about the vtune or something Presumably you would still have to make some changes, I think to you know Hook it hook it up, but they should be fairly minimal So I don't know whether I should call it high-level design features again There's kind of random things that I remember about what happened. So first of all, we manage the PMU hardware in VPMU for HVM And what this implies is there are two things that VPMU really manages it manages the state of the VPMU You know whether it's running it's stopped. It's saved loaded their five or six states And second is the VPMU context, which is really the register state So when you know a new guest gets on the CPU the all the registers are loaded into the hardware and Things continue in their way So the way it works is pretty simple there's no magic about it When the tool runs it loads the the registers with whatever value it wants to load it with and then at some point when the usually when the register hits, you know zero or overflows In an interrupt occurs the Zen Runs a handler and it does two things. Well, there's more things but the the things that I mentioned here are two things one it passes the Instruction pointer to to the guest to the to be be guessed I guess Dom zero or anywhere Dom of Where where this interrupt happened and the second thing it actually You don't have to do it, but this is the kind of optimization that seemed to be make sense. It passes the register the part of the VPMU actually context the the register part of it To to Linux and the reason is when the perf handler runs it makes a fair amount of accesses to to the registers and if we want if we will accessing those registers from from Dom zero or from from PV guess I guess every MSR access is a trap so And it didn't seem like there is a reason to to do this When we can just pass all those registers to Dom zero and make it, you know go over those registers there So I call it emulate in you, but emulate is probably too big a word. It's really just a ray of registers That's what the prof handler will do Oh, and I should mention that when the inch up happens the Zen handler stops the The PMU completely so it stops other counters as well and that's done Mostly for simplicity because you don't want to get another counter fire up where while you're processing the current interrupt and To me it seems like it won't make much difference in terms of kind of getting the accurate results So when a perf handler Is done with with processing the the interrupt so it reads reads the value of the counters writes out the The data that it wants does the hypercall. There's a new hypercall. It does a hypercall down to Zen Zen reloads the the counters and continues running and this is repeated many times Another thing to be worth mentioning is there are two modes of running Perf yes, this is this is probably per specific one is kind of your regular vanilla Perf when you're running it in in a domain and if it's a PV domain It measures its own performance if it's don't dom zero it measures its Performance plus the hypervisor There's also a global profile mode which which I was kind of taken took from Xeno profile has a similar feature and perf KVM actually has a similar feature. So I figured why not have it for Zen as well, which is when Zen the Perf running in dom zero profiles everybody profiled itself hypervisor and all and all the guests and in that mode the PMU is actually disabled in in the guest Another thing worth mentioning it perhaps is so the interrupts that are taken By Zen on counter overflow can be either a pick interrupt, which is what we have now for VPMU Or it can be an NMI interrupt, which is more useful when you want to profile hypervisor and hypervisor there are you know pieces of Code obviously that run with interrupts disabled and if you want to profile those you want to run an NMI mode so Of course because it's stage one it comes with limitations First of all if you want to get accurate results you should really pin the VCPUs to pin VCPUs and The reason is mostly because in as I said, I didn't want to modify non-zen parts of code and at this point there is no place for us to stick the PCPU value in the in the perf kind of data structure So for now I figured well we'll pin it and we'll assume that VCPU is kind of known when it where it is The second one is Where probably the booze should start? You can only profile hypervisor on the CPUs where dom zero is running So if you have you know 32 core system and your dom zero runs on first four cores Course four through 31 will not get profiled in the hypervisor and Other are kind of not too big limitations One is the sampling is not supported and that's partly because it's not supported in VPMU anyway So that's that's a new feature And currently there is no backtrace support for hypervisor. So you can still get the backtraces for For the guests dom zero or not dom zero. It doesn't matter, but the hypervisor You will not get you actually will get the trace the backtrace, but it will probably be incorrect on status so we went through a couple of Versions on the mailing list for Zen patches and I was working on the three And I was hoping to get it in for four and then I discovered some unpleasant things and So it says for four is unlikely and three lancer is a hell no It's not gonna happen for four four So And there are three sets of patches really for for for this this whole thing one is the Zen patch Zen patches and this is probably the main part of the Patch set then there is a bunch of patches for Linux and they are driven by Zen patches because that's what defines the Interfaces and all those things and finally there is Changes to perf userland part Mostly to make it understand, you know symbol tables to make it know where to look for Zen symbols and all that stuff So so that's obviously not the Zen specific. That's in generic code So future enhancements and you can call them enhancements you can call them really making things work properly The address the limitations from stage one the main one is so opinion requirement. I think it's fairly easy There is actually a reserved field in the sampling data, which presumably is where we can stuff the information about it both the domain ID and the PCPU ID and in fact the even the current patches already pass this information to a Linux It's just that Linux doesn't know what to do with it where to put it the second one the Supporting, you know the full full amount of full number of PCPUs that will be more involved Just leave it at that it will require changes to Linux perf part. I have a couple of ideas, but I haven't tried them out yet We need to start supporting peps and you know Sampling features again. This is if you're interested in things like, you know Know exactly which instruction is causing you a problem. You really want to have sampling mechanisms You like if you are tracing, you know trying to find out the hot find the hot lock for example And your lock routines in lined you will not get the right Pointer most likely You will get a pointer to something that says add, you know, a x plus one and you'll say well, why is this a hot? instruction Another thing that I was thinking about is so perf supports tracing, you know, there's a tons of software trace points In kernel and we have a zen trace. So it may be possible to convert those trace points into something that perf will understand and The the goodness about having perf is it has some statistical analysis tools in it that that pretty nice if you have, you know traces Oh I mean, I didn't put here the back trace support that should be fairly easy. There's not much magic to That I think yeah, so this is I I was thinking should I run a demo or should I not run a demo and Every time I run a demo I crash and it's awkward. So I just took the pictures So this is a if you use perf This is the picture that you know very well So you will notice that there is a you know bunch of probes that are happening and when he's then since that's sample taken in the hypervisor So it's a pretty idle system. So there's nothing really interesting here And And this is the the global mode. So we have a guest running there and I think there are five Samples for for for the guest. They're all I don't have any user land samples for in the guest. They're all in kernel Oh, and you'll see that I actually use a perf KVM recording report. It's just It's it's not by accident. I didn't have a type of error. It is actually the command that I used I Presumably when I clean up the patch it will be called perfs and And You will probably see that I have still have bugs But I actually know where this bug is coming from. We'll see that the guess zero has Address in in the hypervisor. So that's clearly wrong But that that actually happens because this this sample was probably taken during the scheduling So the guest was marked as running But it actually hasn't started to get executing. So that's why there is a hypervisor address So That's all folks questions first I'm happy to see a tree a perfect and Can profile the itself actually when I When I did optimization Performing position for for my project. I got a lot of power in Profiling this in One question here. I just as in see in this diagram. I didn't see the Coability of a back trace. So for example, you see what back trace. Yes. Yes back trace from from dawn zero to the So as I said, I didn't actually run it with Oh I Doesn't click Now I will run the demo because I can crash now. So this is for example, I can run perf top and You can see there, you know, a few things. So if you want to see a back trace, I can actually run that trace So if I say perf Yeah, I was G What's that top that? No, it doesn't work now. Hi So we do something So do you have any Okay, yes, so you may or may not this is probably wrong stack most likely because because because I read the stack from From Colonel. Yes, and it's probably the stack and Zen is gone by them by now So all really it needs to happen is at the time of interrupt you pass the rip to the guest and you pass the stack trees which is yeah, if the result is accurate and Revival then that is a very helpful because we can find What? the corner maybe from from the front of zero you could optimize the Domain zero for example to reduce the have a call so you so in this way we can Improve performance. Yeah. Yes. Yeah, and you actually know that that you have a problem It's not that you think you have a problem. You know that you have a problem. Yeah, please. Thank you It's just a simple question. I think you said for the future work is about enable PBS and the I Just want to say something about the PBS is because PBS cannot work If the event happens during the VM exit or VM entry So if you really want to enable PBS you Maybe you keep in mind it cannot work due to the hardware limitation that's Do you well talk to your hardware? Yeah, I think if you wish I think I will give you more details offline about why But we should be able to at least use it in Colonel and user space Even if it can do cannot do sampling for the hypervisor. You can still do it for the the guest I Think this as long as you you can make sure for example, you are turning on the PBS, right? And there's no way and tree VMA exit For example, you focus all seen the PV part maybe it's okay. Okay, any other questions? And to mention it's unlikely you get this fully working on for four But you are actively working on it. So What do you expect to finish by that and how like you know, I learned I learned enough not to answer these types of questions So I have one Outstanding bug that I that I need to figure out how to fix and then I need to kind of clean up code and You know do these sorts of things and go through another round on on the list I don't expect Well, I don't expect to have any issues on the on the Linux kernel side and again, that's what I don't expect There are maybe some questions about the perf the user land Part although I think actually I don't think I ever tried it The the vanilla perf will still work But it won't work for for for hypervisor because it doesn't know where to look for symbols other questions Are you purchased available in the open source already or well, I posted V2 V2 for Zen and if you want for Linux, but V1 for Linus will not work with V2 for Zen Okay, so if you're really adventurous you can pull the V1s for both Linux and Zen and see and or just email it to me And I'll okay. Okay. Okay. Thank you. You'll regret that Other questions