 So before I begin, I want maybe with a quick show of who here has had any experience with hardware tracing. Okay, so pretty much a good portion of the room. And who has had any experience with software tracing such as F-trace or LTNG. Okay, so pretty much all the room also, so that's good. So the title of the presentation is Bridging the Gap Between Hardware and Software Tracing. My name is Christiane Babbour. So when I was preparing this talk, I stumbled upon this picture. It's an ARM processor. It's the Cortex A9. And what I found quite interesting is that you have all the logical blocks, the core, the floating point unit, the eon SIMD, the data cache, the SNU control unit and all that. And you have the PTM, which is the program trace macro cell. One of the things I found really interesting is that this debug and tracing block is, I would say, like half of the neon SIMD and currently in open source software, we're not really using that much the PTM in Linux, for example. So that's what I thought was interesting. So like I said, I work at Efishos. We're a consulting company and a maintainership behind the LTNG project. I have background embedded systems and I'll also contribute to the LTNG projects. And I'll also do open source stuff in my own free time. So the content of this talk is, what is hardware tracing exactly and why should we care about it? Why is it useful for software developers to use hardware tracing in software? So I'll do a quick presentation on the ARM core site and ETM hardware infrastructure and also the free scale core IQ and Nexus tracing. And finally, we'll do a bit of an overview of the LTNG project and what we plan to do and what we'll do with hardware tracing. So what is hardware tracing exactly? So we know that we have blocks in the microprocessor or in hardware components that are used to trace instructions and data movement of a processing device. So I say hardware components because we could have a processing device because we could trace a processor, but also data accelerators, system bus, and all that kind of stuff. What is really important is that hardware tracing permits the use of real-time observation of a system, for example, with an external data port with sufficient bandwidth. And one of the key aspects of hardware tracing is that it's of low intrusiveness. So I want to distinguish two types of hardware tracing. I think most of you have used what we call the external trace, which basically you have your processing device and the instructions and the system bus goes to a tracing device. And the tracing data is outputted to a trace port. And normally what you need with that, you need special hardware to use that and most of the time it's proprietary software to use the output of the hardware trace. So what is great with that is that you can accommodate high data bandwidth but then have minimal impact on the system. But the cons is that a trace port is not always available, let's say in mobile devices or so you can really plug in a trace port. And most of the time you need custom hardware, which is really expensive. What I really want to focus on on this is the self-hosted trace. So basically you have your same system, the instruction bus and the system bus goes in the tracing device. But the trace output is either saved in an internal trace buffer or in shared system memory. So what is nice with that is that the tracing is self-contained and the facilities can be used by the host operating system to employ hardware tracing. And there's no need for special hardware, as in the case of the trace port. The cons is really on most processors you have really limited internal trace buffer space. But you can also configure it to use like shared memory and all that stuff. But it might impact system performance because you may be contending with the system bus, with the shared memory scenario. So here's a list of the vendors that are offering hardware tracing support. So Harm has a bunch of macro cells that SOC designers can use in their design with tracing in mind. So we'll talk a bit more about the ETM. Also PowerPC has the free scale core IQ as the Nexus tracing facilities and we'll talk more about that later on. And they also introduce in the Power8 architecture the branch history rolling buffer. So basically you have a small buffer with the branch history. And also Intel has recently announced last month that they would have an architecture extension which is the Intel processor tracing which is basically processor flow tracing. And a big kudos to Intel. They released the decoder and all the documentation in the test and all that in open source. So we're still waiting for the hardware but it's going to be pretty nice and we're looking into that for the LTNG project. And before that they had the last branch record which has really limited space to save branch tracing information. And there's a whole bunch of other embedded devices using providing hardware tracing facilities such as MIPS. And we saw yes Wednesday the call ray guys had also system trace facilities. So what is really the difference between hardware tracing and software tracing? In software tracing most of the time you need to statically instrument your code or dynamically do code patching which sometimes can be intrusive and can be slow. The level of granularity of the information you have is at the trace point level. So even on a Linux system if you try to trace the whole let's say block IOL layer you have a lot of data. But in hardware tracing the tracing is done on hardware and the instrumentation is not required. So that means that you can run programs that are not instrumented and get information about your behavior and performance characteristic. But you get also an instruction level granularity so you get a lot of data. So sometimes you need to filter it because there's just too much data. So why should we care as software developers and why hardware tracing could be useful? For example we have the use case of profiling. With hardware tracing we could get very fine profiling data versus the statistic profiler approach which for example you have a case 9 times out of 10 you call it infarction and it's taken that much cycles and 10 times it takes a really large amount it will get lost in the statistical average. But with profiling on hardware you will get really fine grain information about oh I spent that much cycle. So that's what we can do with hardware profiling. You can also do performance measurement and also code coverage is really interesting because you have all the branch data instruction. Normally code coverage works by instrumenting the binary to have all the branching info but you already have that with hardware tracing so we could have really fast code coverage. Also we could think about monitoring use cases so what are the statistics currently on the application that I'm running, how much interrupt I'm running in, how much branches I'm running in. Basically some of the it's recoup a bit with the performance monitoring unit but we could do monitoring with hardware tracing. One of the things that is also really nice is that we can take snapshots on crashes or anomaly some kernel developers what are doing I think it was armed they investigated kernel crashes and when the system rebooted the buffers were still in place so they could get the information about the crash because and that works the trace buffers is overwriting the data until an anomaly is detected and then you simply read the buffer. We could also trigger a trace via an event let's say I want this interrupt or I want to trigger a trace after an interrupt or whatever other condition or filter condition this is possible also with most of hardware tracing facilities. One of the things I'm really excited about is really hardware assisted tracing software tracing so instead of using basically the ring buffer in software tracing solutions you leverage the hardware facilities and simply pipe your data from the software side to the hardware so you don't get the overhead of the ring buffer on the software side. So I'll do a quick overview of the what is armed core site and the embedded trace macro cell so core site is really a collection of hardware components and the goal is to trace and debug a whole SOC and it's really an open architecture some manufacturers are providing core site compatible IPs and core site you have trace sources trace sources can be processing elements such as CPUs DSPs but also buses and system trace which is generated from software so the embedded trace macro cell what it does is that in monitor the core internal bus which provides you instructions and data trace you can set up a quite complex hardware filters and triggers so you can say I want to filter these event out and trigger that trace on that particular ETM also does a bit of a trace stream compression so you don't need to output event tracing events for every instruction you outputs let's say only for the branches because you know the instruction before that will be executed and the tracing data can be saved in an internal buffer which is called the ETB or a shared system memory we have an overview here of well I would say a good example of a core site what can core site do so you have multiple ARM processors and you have DSPs each with their own embedded trace macro cell and all that is funneled either through the trace board or a non-chip buffer or system shared system memory so it's a it's really interesting and you can see also the what I was talking about earlier it's the external port interfaces that most of you have already used so what is the state of core site in Linux right now there's a upstream there's an ETM tracer implementation available it seems to work only on specific hardware so I think it was contributed back in 2009 and there's no I think there's not have been much more work on that but recently well recently like last year some some people practically posted a framework patch set for the core site debugging infrastructure and it's as not I would say trigger much attention but it's a I think it has been revived recently by some guys at Lineros for to support that core site framework within the Linux kernel so right now in the Linux kernel using core site and ETM is kind of painful so also even if you you're able to get the tracing data you will need to decode it so and I don't think there's an open source trace decoder rightly available so fewer if you want more information about that there will be a buff by Powell at 430 today in Pentland so go there it will be will be nice now we'll talk more about the free-scale core IQ and the nexus tracing infrastructure so the free-scale core IQ is a par pc-based platform targeted for our performance communication system it supports multiple e 500 mcs processors and it has what they call the data plane accelerator architecture which is basically packet processing and offloading into accelerators which something is nice with that is that you can have tracing events generated by that particular accelerator and funneled back to the software system and they also support to the nexus debugging and tracing standard which will go into more details so the nexus standard is a ISO standard that was created a while back for debugging embedded systems and it was really designed for low pin count and a standard set of connectors either JTAG or debug port debugging and there's also your hardware device can have multiple level of nexus compliance so the basic level is only debugging support such as running stopping setting breakpoints and specter memory so tracing is not supported at that level at the level to we have ownership and program trace ownership tracing is basically when the operating system schedule a task it will write into a specific registers and this will generate an ownership trace message so you can have the ownership that is switched in your program and also the program traces basically the branch tracing the program trace flow at the level 3 we have data write trace which is basically you can monitor in a certain region of memory all the data writes and generate a message for each data write that were accessed at that specific address and finally at level 4 you have multiple optional support for memory substitution and the one which I find quite nice is trace triggering via watchpoint so you put a watchpoint and then you can trigger a whole set of trace or basically yeah so you can trigger traces so the nexus output format is really a packet based output format the standards define public messages that you must comply to but vendors can define their own extensions such as what I said about the DPAA it's an extension of free scale and they can output specific nexus message for that so the messages are a fixed packet side per message and the last packets can be a variable length so you can have a variable length data and message can have an optional stamp and this is because the time stamps always are 32 bits so it can if you generate time stamps for each of your messages you can add some overheads to your tracing bandwidth so it's you can disable it if you want so here's two example of nexus message the first one is the ownership trace message like I mentioned earlier basically the operating system switched in the task with the PID value of this one and we see that we have a timestamp the second message is really interesting we get a resourceful message basically what happened is that the timestamp counter overflowed and for software software program to end all synchronization with time stamps you need to be able to know when the timestamp counter overflowed so when the timestamp counter overflow it will either block all the messages in a queue or simply drop them depending on the policy that you have set it up so what is the state of the of nexus in Linux there's a nexus core IQ debug the kernel module that is available in free scale yachto layer basically the this module implemented debug FS with memory map access to each core nexus control registers so getting the trace is as simple as scatting the trace buffer in a text file this is on the software side so this is a simple listing of the core IQ debug debug FS so we see that we have each CPUs and let's say we have the DPA controller and the NPC and NXC which are nexus related controllers and if we go into the the CPU zero we see that with these registers are important to enable either specific tracing mode for the nexus or the ddim is right only is for writing data to to generate tracing messages so the nexus decoder availability we will release as part of the LTTNG bebel trace project a converter and decoder for nexus tracing format one of the thing that is with that is that even though we have the coded traces we still need to have a proper open source software to reconstruct the whole program flow from the generator generated traces perhaps we could put that in an ID or perhaps Perth would have the right infrastructure for that so this is ongoing work so now we'll talk about what we are doing with LTTNG and hardware tracing so just a recap what is LTTNG we have multiple tracers we have utilities and viewers in the tracers we have both a kernel tracer and a user space tracer the kernel tracer is implemented as a kernel module and it supports the Linux 2638 to 3.11 and the user space tracer is basically an in-process library that's that can you can be used to generate user space events on the utility side we have the LTTNG command line which is basically the whole the program responsible to interact with the demons to enable events in your programs to enable specific kernel events and all that and session D is really responsible for tracing registry and yeah and that and the consumer is responsible to consume data and we also have a really a demon which is responsible to stream tracing data to another host so like I mentioned we have viewers the Babel Trace viewers is really a common line interface text viewer and a trace converter and this is why we will talk about about the Nexus conversion we also have the LTTNG top which is like an end-curve stop like viewer and you can see in real time what is happening with your tracing data and your processes and finally we have a plugin in Eclipse that can show custom views about LTTNG traces and it's highly extensible and eventually it would be nice to support hardware tracing in that viewer so our initial attempt to support hardware tracing is to be able to convert the Nexus format to the common trace format used by the LTTNG tool suite the goal of that is to be really able to use the existing infrastructure for tracing visualization and command line viewing so while doing that some issues that we encountered the Nexus traces are not self-contained you need sideband formation from the US such as the processor frequencies how much processor do you have also the the internal trace buffers is quite limited we have like 32k on the platform I worked on we had 32k of memory and it filled up quite fast but we could use the external DDR memory to store tracing data instead and to be able to synchronize with kernel traces and user space traces the time stamps that are generated by the hardware need to be well synchronized with the other traces which is which are using the monotonic clock and the epoch time so we need to do synchronization with that and it can be quite tricky so I do I'll do a quick demo of the converter and a crash handler that we have okay so the first first thing I will show you is the basically the the data generated from the Nexus tracing port so basically just extremely simple values but if we use the converter we then get the appropriate messages so we have a bunch of data acquisition message which is custom data that we piped into some messages so basically here I had a simple counter it counts to 512 I think so this is the Nexus data decoded also one of the thing I did is that I hooked up a crash handler in in the crash handler file in the Linux kernel and while that is what we can do with that is when a program core dump we generate a snapshot of the user space buffers of the current application running and also I did a snapshot of the hardware traces buffer so we have here the core dump we have the hardware traces and we have the user space traces all in one so what we can do now is be able to view both the hardware traces and the snapshot with the user space tracing information so basically I have the data acquisition message and then my user space events generated from the UST not the TNG UST back so this is a small demo of what we can do okay so future work we intend to work on the more more on the arm and embedded trace macro cell side maybe do a decoder or converter to CTF like we did for the Nexus trace format it would be nice to support that also one of the other thing we're looking into is to be able to control the hardware tracing facilities with the existing LTG tools command line and also finally custom views for hardware traces with the Eclipse plugin would be quite nice to be able to visualize and do custom views in Eclipse so in conclusion maybe what I want to say is it's you have the hardware tracing facilities are already available and they are quite useful to debug some kinds of problems we intend we have worked to do initial support for self-hosted hardware tracing such as with the Nexus converter and it would be nice because manufacturers are all doing some kind of hardware tracing perhaps there would be a common abstraction that we could aim for for hardware tracing in the Linux kernel so I don't have any more slides any questions yeah would he be willing to open source that that's great any other questions yes yes how did you add that okay okay so we have a the question is how did I gather the snapshot uh on the the machine so the core dom handler from the kernel calls the snapshot command of LTG so we get the USD buffers and in that same script what I did is I just dump it the hardware tracing that was available at that moment yes any other questions yes about well it depends on the on on how many events you have you have enabled some solutions will simply drop events when the ring buffer is full and it depends on the configuration for the Nexus you can either blotch for a couple of cycles and then it will drop the event or you can simply drop it when the the buffer is full so it really depends on your configuration all right no other question yes wow I'm actually paid to work on the LTG so that's why I'm working on LTG but it's a nice project and we I think it's they fill both different use cases there's no other questions yes under you you ask what is the status of the core site on the omap ti omap omap 3 is connect ETM, ETB and nothing else it will work but it may not have been tried not any platform well I've tried the core site in many platforms but we usually in my own custom ways of setting up the infrastructure I cannot answer the question whether it's just working or not I think that the driver originally was it was contributed by Nokia by Nokia before and they were using omap 3 for their phone at this time so you might be okay and what version is what the ETM version is okay so yeah yeah it's all right any other question comments I guess all right thank you