 Hello everyone, thanks for being here with us, attending this talk. My name is Patrick Tissiano, this is Alexandre Belon. We are two engineers from a French embedded Linux consulting company called Bellibré. And the purpose of our talk today is some non-intrusive power and performance debugging library, or framework I should say. And we believe there is no better introduction than a demo. So let's start directly with demo. So just to get you started. So what do you see here? First it's actually a two in one demo. Yeah it's gonna be difficult to see but if you were there yesterday at the technical showcase, our teammate Neil demonstrated a completely upstream hardware accelerated video decoding. So it's all software, all upstream software running on a potato board, so on a AMlogic platform. So here you just have video playback running. And what we do here is that what you see on the screen is actually a real time visualization of the CPU load as well as the CPU bus accesses. So this is the CPU load of the four cores. And this is the memory access done by the four cores. The refresh rate is 10 hertz, but it could be higher. And this is to prove you that it's real hardware acceleration. Because the load is less than 10% in average. So if there was some software decoding that would be way much higher. You may argue, well okay it's just another visualization, CPU load visualization too. Yeah, but it has some little magic. First thing is that it's non-intrusive. There's no application running on the target and we are not altering the CPU execution flow. And it's done in real time and it's always agnostic. And it's multi-architecture, which means that this application may run on different boards than this one and for different applications. We have made no single changes or modifications in the code running on the target. And that's pretty awesome. The sampling rate may go way higher and still we will be able to display it without any artifact on the board. So the purpose of our talk now is to explain you... Yes, thank you. To explain you why we believe in that, why we've done that and how. So what's the problem? When your job is to do power and performance analysis debugging optimization you have to have tools that has the following requirements. You have to do non-intrusive access to your platform. You should not be invasive and you should not alter the way the platform is running. Otherwise you are seeing something that is not exactly the truth and that's not what you want. So you don't want to interrupt the CPU execution floor and you also don't want to alter the power states of the various peripherals in your SOC because that's exactly what you want to see. You want to see when the peripheral is active or when it is not. Another thing for us human beings is that we really do enjoy having a visualization of whatever is happening in the SOC. To do that in real time, we need some tools that do that. But combining this requirement with being non-intrusive is already a challenge. The next requirement we want is that we don't want to do that only on a single platform, on a single SOC. In a single use case, we want to be as generic as possible so support many different OS on the target and on the host PC and also support many different architectures. And obviously we want all these to be open source. So maybe anyone in the audience can tell me a name of a tool that fulfill all those requirements. Ok, yeah, that's what I guess. Well, there are many reasons for that. There are technical challenges, obviously. But today we have really great tools embedded in the Linux kernel that helps us profile application executions. But it's running on the target. I would say that impact on the target is limited but it's still there and it's also blind when you do a power transition on the CPU. What we ultimately want to do is that even if the CPU cores are asleep we can still see what's happening on the platform in those sleep modes. Also, many tools may share data to the host over USB, UART, the console or Ethernet. And this is just altering the use case. It's no more the same use case. So we don't have such tool today. So what we want to improve here? First we want everything to run on a host so that it will be non-intrusive and there is no code to rebuild or to add on a target. There would be no compilation flag to switch. It's just the target code is running on its own. We want to have something modular and flexible and scalable so that we can write as many applications as we can think of without every time reinventing the wheel. One thing that we want to improve is also to have something that describes what's inside the SOC in a software programmable way. Some sort of reference manual, but in software. This is the only way, doing so, this is the only way to have generic software that could support different platforms. So here this last bullet is really just an analogy with what a device tree has brought to the Linux kernel. If you remember the time before device tree we would have to write C code to support every single board and now thanks to device tree we just write a single driver and it adjusts with the different base address and other settings. And here is Lipsoka to save the world. What are the main features of this famous Lipsoka? Well, Lipsoka enables to make generic non-intrusive access to any register or memory of a chip. And this will be done over JTAG. We'll explain further later. Lipsoka also helps at abstracting architectures thanks to SVD files that we'll also describe later. It's some sort of DTS file, if you want. Lipsoka is OS agnostic and by this we mean both on the target and on the host, because everything is written in Python. You may find the source code following those links and the documentation following this link. Why JTAG? Well, this is the only solution we have today to make non-intrusive register accesses. As simple as that. It's also great because it supports hardware or software breakpoints and watchpoints which helps also being able to develop applications that are even-based and not only based on polling. JTAG is supported by almost any board, at least in the development phase and SOCs. And we can manage it thanks to some software library called OpenOCD. SVD files stands for System View Description Files. It describes the SOC typically. So in those files it's XML-based and typically it follows the hierarchy of the SOCs so you would have a description of the device, the peripherals and the registers and the bitfield. And for all of these elements you will have all the details, the like if you were reading a reference manual. There are some examples for you. I'm not going to spend a lot of time on it but you have the name of the device, the name of the vendor, if it's a 32-bit or 64-bit architecture and many other details. That's for a description of the device. Now, if we look at the description of a peripheral, so it's a RNG, so random number generator, so it's an IP, a peripheral inside the SOC and we have its base address, the number of registers and the descriptions of the registers are then following here. So it's very easy because you can parse this very easily. Python has everything for you and then you can start thinking about, hey, wait, maybe I can now access my registers no more by values, so by address with x values, but by name. I can also read and alter a given bit field by name, no more by address so I don't have to do some printk trace to get the x value of the register and read into my documentation what it means. It's direct. Now, I let Alex describe a little bit more the architecture. First, we have the board itself. On the board, we must have JTAG, obviously, to use it. So, usually, we are using JTAG for one thing. It's to stop to alter the CPU to do step debugging, things like that, and to see the memory. Here, we are not going to use it in that way. We just want to access to the memory, but we don't want to stop the CPU. So instead of connecting JTAG to the CPU access point, we are explicitly asking the JTAG to connect to the memory access point. In that way, we are able to access to many registers, many things. We need to stop the CPU and to use it to do the access. So, to connect the JTAG to the computer, you need the JTAG emulator. So, here, we don't need a specific emulator any emulator supported by OpenACD could work. So, for the demo, we are using G-Link, but actually, any other emulator, even non-expansive and cheap could work. So, we are using OpenACD as a telnet server. Same as the JTAG and the target. Most of the time, OpenACD is used as a GDB server. So, we run it and we just stop the CPU and start to do things with the CPU. Here, again, we are using OpenACD differently. So, we use it to connect to the memory access point and we just, using telnet, send some basic commands to read or write things in the memory or in the registers. So, we are using a library wrote in Python that abstract all the telnet protocol. On top of that, there is a Lipsoka. So, the goal of the Lipsoka is to abstract as much as possible all of that. We don't want... Actually, we are using a JTAG, but we don't want to be stuck with JTAG. So, if we want, if we adapt the JTAG abstraction layer, we could eventually support any other medium. We still prefer to use JTAG because it's the only one that will be non-intrusive, but JTAG is not always available. So, it may be good sometimes to use something else. So, this layer, we have to do that. On top of that layer, we have the device register access layer. The goal of this layer is to provide an easy way to make access to the registers. I'm quite lazy as a developer and I don't like to, every time to have to remember the offset of the registers, to do bit offset, to bits shifting, sorry. I don't want to manipulate the bit. So, what I did, I made abstraction in order to be able to write directly to the bit field. So, if I just want to write one bit, the register, I just take the name of the register, dot the field name, and then I write the value. And same to get the value from the register. And the layer will take in charge all of that and will transfer everything in JTAG command. So, on top of that, there is nothing else to do. And all of that rely on FSDV SVD files. So, as Patrick explained, SVD files describe the SOC. So, it provides the list of the peripherals, the list of registers, the fields. And using all of that information, we could just create a tree of peripherals, registers, fields, and automatically get compute the bit offset if we want to set a bit and things like that. All of that is possible things to SVD files. And on top of that, we are providing many systems and SOC files. So, for the sub-systems, we have currently with only two sub-systems, clock and PMU. So, the goal is to provide something generic enough to be used by applications with what you have to do to deal with specific anything related to one SOC or another one. So, the goal of the sub-system is really to abstract that. So, and we have the architecture files which are the place where we write some kind of drivers for a specific SOC. So, basically, for the PMU, I wrote PMU drivers that work both on ARM7 and ARM8. And I just have, for the SOC, I want to debug using this PMU. Just have to instantiate the ARM PMU driver. And on top of that, there is many other layer. If you want just to get the CPU load, we have a perf sub-systems that abstract all the ARM stuff. And on the top, obviously, there is the application we want to write. So, as I said, currently, we are only supporting clock and PMU but we are planning to add many other sub-systems as we have many crazy ideas, things we'd like to do. About the architecture, as the project is quite young, we only started it a couple of months ago. There is not a lot of articatures. So, we are only supporting the PMU for ARM7 and 8. The SOC, we are supporting one-name logic, one-name XP, and one-name STM. But actually, that is the only one I tried. It would be quite easy to add support of other SOC from the same family. As everything is described as a Python class, we would just have to derive the class to inherit. And just to customize the things that change from one chip to another one. So, that would be quite easy to add a menu to the SOC. So, we have started to write applications. This is still in progress. So, we have already seen the PMU graph. So, PMU graph is a tool to see the CPU load, but the idea is just to have a small tool to display one or more PMU event or a perfect event. So, it supports any event exposed by the ARM PMU subsystem. It's already time, so it's non-entresive. So, it's about overhead. So, as I said, for the CPU, no overhead because we don't stop the CPU. So, I don't know if you were able to see the screen when I show it, but we were able to display to run a 4K video decoding and it was fluent. No bug on the screen. So, no impact on the CPU. And the interconnect, it's negligible because it's about 500 bytes per second if we are using a sample of 10 hertz. If we want to increase a little the sample rate, it doesn't use much more, a lot more bandwidth. I mean for 100 hertz, sorry, for 100 hertz, it would be about 5 kilobytes which is negligible if we compare to the capacity of the interconnect or if we compare even to the speed of the GTAG. With the default speed we are using on this board which is 4 megahertz we could up to 500 kilobytes of data per second. Everything is already on GitHub on GitHub and on GitLab. We moved recently we made the migration so now it's on GitLab. We also brought a tool, a simple tool to just read and write the memory or the registers. So what is nice with this tool is again we can use the register's name. So I don't have to remember the set of the register I want to read, I just need to know his name which is nice. So and in addition of the register itself we can also get the bit-read of the register. So we can get the details and it's really nice. It makes things easier. And further we are planning to add some features using the watch brands to monitor some changes of registers. So if you want to see what value take a register change we could set up the monitor and just see the value every time it changes. Things like that. Again this is also available on GitLab. Still in development but most of the basic features are higher. We are also working on the clock tools. This tool is the main feature of this tool is to display clock summary. Actually quite similar to the one we could get using ccfs with the kernel. The goal is not to replace the clock summary we could get using the kernel. It's more to do things we could do not with the kernel. For example if the CPU is going with some low power state then we could run any command on the command line. We could not run any command on the CPU. So we can get the state of the clock in this low power state. But with JTAG we don't have this issue. If JTAG is still up when we are in this low power state you could read the registers of the clock and then show the new state of the clock in this low power state. In addition this also work on something else as Linux. I mean we could use it with Zephyr. Actually the first time I did it it was on a STM32. So I've been able to get the clock summary on the F32 which is not possible using Zephyr. It doesn't have such a feature as the kernel. This is still in development currently is not yet on GitLab but it should be there soon. One last application or example of application and this is really something I can't wait to have and to use is some sort of application that would be able to in one point show us everything the power states of the various peripherals inside the SOC but also the power consumption the CPU loads everything. We may have such tools available today but it's different tools. I have one tool to measure power I have one tool to trace execution I have one tool to visualize CPU loads I have one tool for the clock tree I don't want everything in a single one because you know that we need to see everything in sync to understand the causes and the consequences of different changes. So the idea of this profiling tool, we haven't name it yet is integrate in a single tool the collection of SOC power states clock states power measurements using another external power measurements tool like ACME like ACME that Beliorae is developing and we put that in a single window in a single application think about an oscilloscope an oscilloscope application where all the various charts is actually one data from the EDSOC and that would have the basic and I would say the regular features we all want like being able to start, stop, freeze and resume data collection being able to zoom in and out in the various charts save and reload those traces but also export to different formats and finally also we would like to have some sort of command line interface so that we could further integrate that tool into some other already existing tool like for instance a continuous integration loop Obviously this development hasn't started yet but we will do it as soon as possible and now Alex will present you how is it can be to develop application for Lipsoka Yeah, so here I just some example of it's a way to enable the ARM PMU counter so everything in the Lipsoka is coming from the device so you have a device which is the root of the class all the class we have and in this device you have all the peripherals and then in the peripherals you can get access to the registers and then you can do a write to the registers and here I don't use the bit field because it was more convenient to do the bit shift because I want to dynamically enable one counter I have to do that but if I wanted to specifically enable the counter zero I just would have used I just would have add to add dot P0 which is the name of the bit for the counter zero so everything is met in that way all the driver trying to do everything like that which is quite easy to read to write and it's quite generic because in the case of the ARM PMU there is some differences between ARM 7 and ARM 8 some registers are 32 bits some are 64 bits but here we don't really care the SVD, we just say here it's 32 bits so we will send to GTag read or write 32 bits and if it was on ARM 8 64 bits same it will have detected this and then send the right size another higher level so here is just another example of how to use the PMU so again we get the PMU from the device we get the counter we want to use this one is one for the CPU cycles and then we enable the counter we enable the PMU and we can read the value from the PMU and another example here if we don't want to here is more if you want to add the ARM PMU to your SOC to your SOC, sorry so again you can get the PMU using the device then here we understand what we do in the instance of the ARM CPU load so this class will abstract the way to compute the CPU load actually we just use this counter take the frequency of the CPU and compute using that the CPU load so everything is abstracted and we just have to call the gate value to get the CPU load and higher level actually that's what we are using in the PMU Graph Application so if we want just to get the CPU load we get some perf event we get them for each CPU and then we just have again to use the gate value to get the value from the CPU load event so there is many way to do it Python is quite Python is awesome you can do many things in many way some are easy some are less easy some are more generic so it's just a short example of what we could do so while I was working on these tools I got a couple of issues first with JTAG itself because it's easy to find the documentation to stop the CPU to debug the CPU when it comes to the main access point to do something else than debugging the CPU there is not a lot of documentation the course documentation is quite complete but there is a lot of information too much and again it's hard to find what we want another thing is that SoC doesn't have the same features for debugging I mean some features are implementation defined so they may change from SoC from another one as example PMU graph is working fine with the Imologic SoC but I could not use it with OMAP3 because OMAP3 doesn't have OMAP3 so I can't access it using JTAG if I want to get data from PMU I have to stop the CPU and use the CP15 co-processor another issue is with the SD files so they are not always easy to find some SoC vendor give them no issue some use third file to SVD which is not a big issue and some doesn't give anything and here when it's the case we have to write one which is quite painful and not very interesting but if we want to use the tools sometimes we have to do it and there is no way to include one SVD from another one which could be an issue when we start to write we own SVD I will take an example SVD from SoC vendor doesn't have anything to describe the ARM PMU so if we want to get the description if we want to have the PMU in the SVD we have to include it to the SVD so currently it's not supported so I've got a tool that merge a small SVD file that only describes the PMU with the SoC with the SVD we got from SoC this merge them and then we can start to use the SVD and get access to the PMU so got many issues with OpenSD it's quite good tools but it's not easy to set up sometimes so sometimes when the chip is supported nothing to do and sometimes you get many warnings some errors things like that and there is not always a lot of documentation that explain how to fix it how to make a work chip and also got some issues with the ARM7 ARM8 when I tried to first I've been working on the ARM7 to do the PMGraph tools and when I try to do the same thing on the the AMlogic SoC which is ARM8 I got crashes from the OpenSD tools source code and to understand why and to finally figure out I was not doing things like it was expected but it was not documented so yeah and on the top of the CD I'm using a library in Python that control OpenSD using Telnet which is good except that it requires a lot of process to convert the text to some data to value things like that and in the case of watchpoints or breakpoints it's we are receiving events, asynchronous events and it's not correctly managed by this library I try to manage it to inherit the library and to manage it but it's still a good job so I'm planning to write a new library that will use the server API which I've been introduced quite recently in OpenSD instead of Telnet so yeah, so what's next so this is where we are we describe almost everything we had and what do we plan next so these are many things we have in mind they are not prioritized so the order is not the order of priority but this is what we could think of and what we want so first we want to enable watchpoint and breakpoint support this is very important for profiling applications we want to avoid being able having, sorry, we want to avoid having to pull for data you know that when pulling you may miss some transitions so this is not really acceptable we want to integrate some CI frameworks so to start doing some regression testing for instance we could track some key settings and key performance indicators that would be collected by some Libsoca app and saved by the CI framework we we want also to make Libsoca re-entrant so we want to enable concurrent use of the library today as usual we want to start writing documentation for Libsoca we want to support obviously more SOCs more subsystems, more IPs and obviously also build all the craziest and smartest apps, debugging apps we can think of it is time for conclusion so here are a few things to keep in mind when we will end this call and return home JTAG really offers a unique solution for non-entrusive real-time monitoring tools it is extremely powerful and actually it's not used to the, it's underused I would say today as Alex explained we use it today just for regular step by step debugging whereas there are other purpose like the one we just explained other thing is that maybe it's new to you but those SVD files are very important to enable writing multi multi architecture generic software Libsoca is a very innovative software framework helps developing generic debugging apps on top of JTAG SVD files but also other medium if JTAG is not available PMU Graph is a first demo app pretty basic but just here to demonstrate the potential and we count on you to develop the smartest apps possible thank you very much maybe for some questions if we are not too late any questions yes how what well it's already the case the application if you can toggle back to the demo if it's still on yes you have the four cores so it's there's some kind of discovery mechanism so we have those SVD files how many ARM clusters how many CPU cores you have and so your application may just get that information and create as more charts as needed yes in the SVD files there is four PMU instance that being described in the files so we just for this SOC instance for ARM PMU driver and then we can get the PMU for all the CPU ok can we use this framework with CPUs that doesn't have PMU like Cortex M0 for example STM L0 you said you support F4 but if I want to go lower L4 yes I didn't write yet the support for the F4 the PMU is quite different but yes it's something that could be done the only things with this the PMU of the Cortex M it's just you know it's overflow quite quickly so we have to pull quite often to avoid to get to have an overflow but yeah it could use the same mechanism so yeah really the idea is it's a framework and it gives you access to all the registers and here we have developed a first code for the Cortex PMU for M0 it's a different one so that would be new code that's all another thing for the Cortex M family I think it would be more appropriated to use the watchpoint for that because we could program the watchpoint to generate an event when there is an overflow on this counter and then we don't have an event to pull we could just wait for the event and then increase the counter on the software the host thank you any other question please thank you very much for coming