 Hello everybody, I am Ateesh. I work for Western Digital Systems Software Research and today I am going to talk about Perf on Risk 5. So, what we will be discussing, how Perf is implemented on Risk 5, what is what exists at today and what we are working on and what the future holds for Perf running Perf on Risk 5. Now, here is a brief overview of today's talk. First, we will start with a brief introduction of Perf and we will move to the Risk 5 specification that are required for Perf and what are the improvements that have happened recently with those specifications. Now, second, we move to the implementation part where what exact where is the status of the implementation of those specifications and at the end, I will show you a demo of running in KMU live. Now, what exactly is Perf? So, if it would have been an actual physical presentation that some of you are present at the venue, I would have asked this question and then we would have idea to get a feel of the room thing, how many actually of you have used Perf. Since Perf is so generic tool, I am assuming the answer would be most of you know already the Perf, most of you know already what is Perf but for those who have not used Perf, it is an official sorry it is an official Linux profiler tool that is used to instrument any kind of performance counter of CPU or at race points or K-Probes or U-Probes. Now, we use this when to analyze your workload, where your workload is slugging, where the pitfalls are, where your workload is performing bad. To figure out those pain points, we use profiler tools such as Perf. Now, this talk is not about how Perf works in Linux. There are so many good tutorials, some of them by Brandon Gregg and some of them are by others. In the internet, you can find them by googling it and they go into lendier details in what exactly Perf and how you can use them. If you want, you can go ahead and do that. But this talk is about how Perf is implemented in Risk 5. So, we are talking about the performance counters in Risk 5 and how Perf leverages those performance counters so that we can have a feature complete a Perf on Risk 5, which is on par with any other architecture. Now, before diving into the details of the specification and what is going on, what is the present status of the development, let us take a look at what is the current status. So, in upstream Linux, you might have seen already there is a Perf already available. So, what I am going to talk about here is if the Perf support is already there. So, to clarify, what Perf support is there is very minimal and not very useful because it has been upstream from the last three years. But all those three years, it was not improved for the reasons that I will explain later. But even if what support exists currently, it is only to monitor a cycle and instruction count, which are running always. So, you cannot start or stop them or not. Neither you can use any of those programmable counters. You can't sample any of those events because there is no support for overflow interrupt. Now, let us look at the specifications that we need. So, as I said in the previous slide, there are reasons why all the details of the Perf were not supported because there are missing pieces in the specification. So, before discussing the missing pieces and specification, let us try to understand what exactly is in the specification as of now. So, current privilege specification specifies three fixed counter. One of them is cycle and instruction count and then 29 programmable counters. And there are 29 event selector registers corresponding to those 29 programmable counters. Now, those event selector values tell that which performance event is going to be monitor using this counter. Now, the problem with this is the problematic part is these counters are only readable or writable from machine mode. So, RISC-5 machine mode is the highest privilege mode. And on top of that, there is supervisor mode where your rich operating system such as Linux runs. Now, supervisor mode can read the value of these registers but via different CSR, which is known as HPM counter X, which saves a shadow copy of these counters and it's only read-only. Now, once we have the counters, we need a way to start and stop them, right? So, we have, that's why we need mCountInhibit CSR. So, mCountInhibit CSR have a bit corresponding to each of those counters and enabling, setting this mCountInhibit actually stops the counter and clearing them will actually run the counter. Now, for the accessibility of the counters, this mCounterEnable controls the accessibility. So, if the bit is set, the supervisor mode will access it but if the bit is clear, supervisor mode cannot access it. Now, there is a corresponding sCounterEnable register also, which controls the accessibility of user mode and it can be accessed by the supervisor mode. Again, mCounterEnable and mCountInhibit, as the name suggests, prefix suggests m, it's only readable and writable by the machine mode. So, what all we need, what is exactly missing to have full power support? First, supervisor mode needs to access the event and the counter registers. They need to start and they should be able to start and stop the counters. Now, they need to enable and disable the access also, access bits also. Now, the simple solution would be to go ahead and add those bits enable those CSR to be readable and writable from supervisor mode. But since RISC-5 is an evolving specification and it's continued to be backward compatible of the release specification, it's not that simple. We cannot just go ahead and make an incompatible change. That's why the solution that is adopted is an SBI PMU extension, which allows the supervisor mode to perform these operations for while accessing the counters. Now, the SBI is basically an interface between supervisor mode and the machine mode and any privilege operations that a supervisor mode cannot do. SBI allows the supervisor mode to execute those functions in a mode on behalf of S mode. That's why there's a SBI PMU extension that was proposed and it's part of the official SBI specification now, which enables a supervisor mode to have these features in POP. But SBI extension is not always sufficient because there are other support we need from the privilege spec itself because to support counter overflow, we do not have an interrupt number. So we cannot do anything about it. So we need an interrupt number from privilege specification for counter overflow. Once the counter overflow happens, the supervisor mode interrupt handler gets called. In that interrupt handler, it needs to know which of those counters actually overflowed. Otherwise, it doesn't know which counter it doesn't know how to compute the overflow value of those counters. And also there is no mechanism of event filtering. So for all these features, we need an IS extension, which was proposed and is on its way to be a frozen, which is known as SCOF PMF. So we'll go over each of these extensions one by one and understand what exactly these extensions have to enable or to address these issues. So first let's take a look at the ISA improvements. So first, the SCOF PMF extension. I know the SCOF PMF, the number, the abbreviation is really tough, at least for my town. But there is a reason for it. It does not abbreviated randomly. So the SS stands for the privilege arc and supervisor level extension, while the COF stands for counter overflow and PMF stands for the privilege mode filtering. Since there are, our scribe is extensible, there are so many extensions, we need a proper qualified meaning of these extensions so that in future, by just reading the extension name, we could be able to understand what exactly these extension mean. Now, this extension is very close to being frozen. It's currently under public review. It went to public review, I think last week, two weeks back, and there's a 45-day period. So after that, it can be frozen. It's available at this link and also in the ISA day, Google for a Google group. So you can take a look at it. If you have any feedback, please give us the feedback. This is the phase where we collect the feedback from the wider audience and encrypt and make changes according to those feedback. Now, the beauty of this extension is simple. It does not add a lot of new CSR, which requires a lot of modifications in the privilege spec. So it adds only one generic CSR and apart from that generic CSR, it uses the upper 8 bits of MHPM event CSR for overflow and filtering. Now, for RB32, the MHPM event doesn't have a upper 32-bit, it's only defined MHPM event. There is no MHPM H. That's why the ESCOP PMF extension also defines 29 additional CSR just for RB32 so that we can have MHPM event 3H to 31H. Now, how does exactly it handle the counter overflow? As I said, the upper 8 bits of MHPM event are actually used for counter overflow. So out of those 8 bits, the top MSB, the top most most significant bit 63, bit 63 defined as the overflow bit. So whenever the counter overflows, this bit is set and then interrupt is generated. So since both are related, interrupt generation and counter overflow is related, it also acts as an interrupt, enable and disable bit and it's hardware should only generate interrupt when this bit is clear and this bit is to, if you want an event not to generate an interrupt, then you need to set this bit, you need to always set this bit so that interrupt will not be generated. Now, we know how to generate the interrupt, but we need an interrupt number. That's where the new bit number in all the pending and enable registers come into picture. The bit 13 is assigned for counter overflow. If the hardware platform doesn't support SCOP-PMF, then it's always must be hardware to zero. And for the virtualization case, the advanced interrupt architecture specification handles the overflow interrupts. Similarly, the SCOP, there is another CSR, which I discussed previously that one new CSR that was added is SCOP-OVF to understand in the interrupt handler, supervised more need software need to know which of this counter overflow. So it's a 32 bit only register that contains sadder copies of OF bits, that's the most significant bit of MHVM event, the OF bit of each registers. So that it knows that, okay, this event overflow, this event overflow. Remember, but even if it's a 32, the valid of bits defined are MHVM in 3, 231. So we have only 29 that are valid. So then obvious question is what happened to the cycle and instruction? Because cycle and instruction corresponds to 012. And then you don't have an overflow bit for that. So because that was not because it's not supported in the SCOP PMF, what we can do is if SCOP PMM supported, we can use the programmable counters to monitor cycle and instruction as well. And then it will just use, it will be used as a any other event other than a fixed event. Now, similarly, M counter enable bit is required and the bit it needs to be set to have access to these bits in SCOP. Now, last part is event filtering, which is not too complicated because it defines a bit for each privilege mode. And if that bit is set, counting of events is prohibited if the code is being executed, CPU is being executed in that privilege mode. Now, that's all for the SBI PMU extensions. That's all for the RISC-5 ISA PMU extensions. Now, let's take a look at the SBI PMU extensions. Now, so SBI PMU extension provides an interface for supervisor mode software to discover the details of the hardware counters to configure them to start and stop them. In addition to that, it also provides a set of former counters, former performance counter, where while running your workload, you can analyze the performance of the former as well. If, let's say, for some reason, a former is performing poorly, these events can be used. It also give you an access for all the raw events that are defined in the micro architectures and then it exposes those through event IDs. Now, supervisor mode, one good thing is supervisor mode is abstracted away from the hardware counter IDs. So it walks on a logical counter IDs, which would be returned to supervisor mode software upon request by the M-mode implementation or M-mode runtime firmware. The specification is already part of SPA V0.3 and it's already frozen. You can find the specification at this link. Now, taking a deep dive into the PMU extensions, it has different kinds of events. Three of them are hardware events. First two are general events and the cache events. These are one-to-one mapping between the events defined by the Linux puff and then there is hardware raw events that any CPU can define the micro architecture event that they want. And by specifying raw event, the PMU framework knows that it's a raw event that's going to be monitored by the CPU. Now, there are also firmware events. Now, since SPA is a big component in RISC-5, depending on what platform are you using, you may have to rely on SPA for IPI and timer, also for suspense. Also, RISC-5 iso-specification doesn't allow the misaligned load or store. So, if in case of a misaligned load store, there's a trap and then it's a transparently handled in OpenHPA, which is the M-mode default firmware in RISC-5. Now, all those trap events can also be monitored using PowerNow. Looking at the functions, so the first function basically gives you the number of counters that the hardware supports. The second one gives you the details information of those counters, such as what's the width of those counter, what's the CSR that's, remember, these are the logical counter. So, what's the CSR number for that counter? And then the third function gives you an LAG approach of finding a proper counter for a specific event. So, this gets called when a perf event in it actually gets called. So, only when you are running perf then only config matching is called. Other two are started during the boot. Now, during the config matching, the M-mode form where figure out what is the correct, what is the valid counters for this event and then try to matches this counter between among the IDX base and mask that's being passed from SupervisorMode. Now, once it finds, once it found the counter ID for event, it can start and stop using these functions. The flags represent a different context for those start and stop. Let's say you want to, it's a reset stop, you are resetting during the stop rather than just start and stop. So, that flag can be used or let's say you want to use, not use the initial value that's being passed in the fourth parameter. That's when you can use the stat flags. Now, remember, to read a hardware counter, we have the counter info that was being saved, but so that we can read the HPM counter directly. That CSR number we can read that directly. But for firmware, we need to make an additional call to get the firmware read value in the, bypassing the logical counter ID. Now, that's all from the specification side. So, we have addressed all the specification requirements that have been gone in recently to make a, make a full feature parity with all other architectures. Now, let's look at the implementation, what exactly is there and what are under current development in terms of implementation. So, first, take a look at how perf exactly works. So, the perf is a user space tool. So, you run it from the user space. So, when you actually run it, you try to make a system internally tries to make a system call to get into the kernel. Kernel creates a perf event object and that perf event objects is depends type of that perf event object depends what kind of event you are monitoring. If you are monitoring a trace point or software events, it will create those type. Software events are BCL context, which are phase faults, trace are F trace or K props, U props. But if you are using the hardware event, which is in this case, we are mostly dealing with the hardware events, because we are dealing with performance counters of the CPUs. And so, it will create a hardware PMU. So, hardware perf event, so which can be monitored, which can monitor all the CPU related events or hardware events. For the firmware event that I described, we still continue to use this because we do not have another option or generic firmware counters. So, we treat firmware events as raw events. Once we have data available, that will be written into a ring buffer. That is a map in the beginning and user space tool will continue to read that ring buffer to give you the required data you need. Now, for the Linux perf implementation, as I explained in the beginning, support was very basic and it is implemented under RQisk 5. So, you can't really extend it also and you can't use any program encounters. Currently, we have moved the implementation to driver perf and it is a completely new driver. It is a platform driver that is being probed based on DT parameters and it implements the completely SPI PMU extension and SCOF PMM extension. So, the interrupt overflow is being supported through DT discovery and the good thing about this new driver is that all the core perf functionalities are abstracted in the core library and only SPI related functions are in the SPI specific driver. Now, tomorrow, if there is an alternate approach to SPI for RQisk 5, as describes main mantra is to be extensible, we can easily extend this so that the new driver can continue to use different methods but continue to use the basic core perf functionality for RQisk 5. Similarly, if there is an interconnect that has a PMU and you want to use that interconnect PMU counters, you continue to use that perf core functionality perf library and you only implement the interconnect related functionality in the interconnect PMU driver or any other PMU driver. This also gives us a way to if the to support the legacy driver. So, the legacy driver is rewritten like when I say legacy means if a platform is running, you are running older firmware where SPI PMU extension is not available or let's say you are running older hardware where even M count in it is not supported. In that case, SPI PMU extension will not available. So, in that case, it will fall back to the legacy driver which will only provide you the what currently exist is basically the support instruction and cycle count. Patches are already available in the mailing list. Please go and review and if you have any comments please let me know. You can ask your questions in the mailing list as well. Now, the second part of the implementation is OpenSBI. So, OpenSBI is the default open source most commonly used open source SPI implementation in RISC-5. So, it implements all the extensions that are required for full PMU, full perf support. It all the SPI PMU extension was already most in couple of months back and for the SCOP PMF, the patches are already available. So, it supports all the firmware counter that are defined by the PMU extension and the good thing is it dynamically detects all the necessary CSRs so that it can be always backward compatible with the previous platforms or older platforms. So, if there is a older platform that doesn't support M count in a bit, it simply disables SPI PMU extension because there is no point of having SPI PMU extension without starting or stopping the counter without unprovisioned to start and stop the counter. Similarly, SCOP PMF extension is also dynamically detected so as the number of HPM counters and the width of those counter. Now, all the platforms need to define is a mapping between event and the counter. So, it needs to define which counters are used to monitor which events and then what are the event selector value for that for the generic events and for also for the raw events. Because the ISA doesn't define the raw, any of the raw events or the but the SPI PMU defines the generic and hardware cache events. So, the platform needs to provide that mapping. Now, there are different ways to provide that mapping. One is the DT bindings. These DT bindings, since this is specific to OpenSBA, these DT bindings are also specific to OpenSBA. The platform can just encode these DT bindings to provide that mapping or if they want, they can go ahead and implement their own code and then use the hooks that are already available to use those mapping in the generic SPI code. Now, the generic format that is that same to boot all the platforms and describe relies on the DT bindings by default. And if the platform doesn't want the DT binding, they always welcome to use the plugins, but I would highly recommend to use the DT bindings to make your life simpler. Now, the last part of the puzzle is scheme implementation. Now, there is no existing hardware that implements SCOP-EMF because the extension is recently proposed and they're now being frozen. So, we rely on QMU to verify our implementation. As the specification was limited in the past, the implementation in QMU was also limited in the past. There was very less support for the counters accessibility. It was only readable. There was no right support. It was always running and only a cycle and instruction count was there. Obviously, there was no M count in a bit on SCOP-EMF and even the number of programmable counters are fixed. So, the current implementation changes all those. First, it makes the number of programmable counters configurable. So, you can actually configure how many programmable counters you want. By default, it sets to 16, but you can always pass it as a CPU feature. Same as it uses SCOP-EMF, also is a CPU feature. So, depending on what kind of machine you are using, you can enable or disable it. Now, we need MHPM counter. We need read and write. So, it enables all those events. It adds additional perf events such as ITLV, DTLV, load store misses, which are a great way to verify the implementation of perf. Now, since it is also implemented at SCOP-EMF, if the CPU supports SCOP-EMF, it enables the interrupt overflow filtering also. The patches are again available in the KMU mailing list. You can take a look if you have any comments. Either you can email to me or you can email in the mailing list. Now, that's all I had today for the talk. Oh, I have the feature work. How can I forget what's in the future? So, for the future work, the first goal is obviously to upstream all these implementations in respective projects. Once that's done, we need to support virtualization. So, we need to support the PMU, add PMU support in KVM. We need to expose the, this is a proposal. We don't know whether it will accept or not. But if we can have similarly, how perf generic events where software, hardware and trace points, we can add a firmware event also. We don't have to use the raw events for the firmware counters. We need to obviously verify it in the real hardware. And we need to support, in future, we can support the perf sampling for firmware counters as well. And another thing, as I mentioned, the cycle and instruction events, account events are not supported in the current account over here. So, that can be dedicated. That can be assigned in a dedicated bit so that we don't have to use programmable counters for that. And that's it. Thank you for attending. We'll switch back to, thank you for listening. We'll switch back to the demo mode where I'll be showing a couple of instructions where how the perf is being used kernel. So, I've already booted here a Fedora instance in latest kernel 5 to 14. If I scroll up, you can see the command line says that the escoff is basically true. That means we have enabled the escoff feature. If you look at the OpenSBA boot log, it also specifies, okay, mCountInhibit is also supported, escoff is supported. If you disable this, it will not show escoff and then counter overflow interrupt will not work. Now, let's run some instructions. First, we'll start with the basic instruction that we already have that used to work. So, if you look at, this is a perf start. Let me zoom it a little bit. So, it's a perf start and we are trying to get a count of number of cycles and number of instruction during Hackbench, which is a standard benchmark for scheduler. It takes a couple of seconds to run. And then you'd see the details of how many number of cycles and instructions were spent while running this counter. And you'll see that instruction per cycle is one, because we use the same number of cycles, same hostics for the cycles and both cycle and instruction count in KMU. Now, let's run a bigger command where multiple events are being monitored at the same time. So, you'd see the first three are former events, which is defined as raw events, but the last bit is set, which indicates it's a former event. Then the next two are anything that starts with R is basically raw event, but this doesn't have the most significant bit set. So, it's a hardware raw event. And then there is a branch misses. And then there is some other KMU, even new event that were added recently. And then standard cycle and instruction. Now, out of these, let me run it and explain. Since we support former counters, it will show the former counters value, but since we do not support the raw event, this is just for illustration purpose. So, I have added the DT binding in KMU. This raw event is supported, but internally it doesn't actually count anything. So, it's zero. But this raw event is not supported. That's why you see not counted. Same thing as since branch misses event is not given in that DT binding mapping, it shows not counted, but all other events that are supported will show you a valid value for those events. Now, this is with the perf stat. Now, let's switch to perf record. So, which is a sampling command. Now, we say that we do a collect samples at every 10,000 cycle channel instruction while running Hackbench. So, it again takes a few seconds to collect the events and then do a sampling of it. So, now we have collected the data. You can do perf report. You can see we have a number of cycles and number of instructions that were available samples. If you look at the number of where exactly Hackbench is spending, now you know that while sampling all these events, Hackbench is spending this much percentage in this function or this kernel symbol. If you go back, try to read the instruction, you will see the similar data because again both are measuring the same value. Now, let's take a look at the other event, which is another TLB store miss event. Now, the number of TLB store misses are less. I am sampling it for only 1000 events. So, after it run, we know that our data is collected. Now, let's do a perf report. You see that there are 216 samples were collected and in that while running, while collecting the samples, again, this function which was the top function in the other one, it also tops in the function and then you know that there are these many events are being created and the CPU spends most of the time, maximum of time in this function. So, that's how you can use perf start and perf record to do event counting and event sampling. As I said, we don't support the event sampling for firmware counters. So, there is no support. You can't do perf counter perf record for the firmware events. That's it from my side today. Thank you. Thanks a lot for listening in and from wherever you are attending, either you are in person in Seattle or actually watching it from your home. Apologies for the coughing in between. I had a really bad flu over the last week. So, I'm still recovering. So, my coughing and the voice is completely gone because of that. See, here it is again. Thanks again for attending the session. Thank you.