 Hi, everyone. Thanks for coming to my talk. Today, I'll be talking about the CordUP subsystem. My name is Eric Johnson. I'm a firmware solutions engineer with MemFaults. And what I'd like to talk about is the CordUP subsystem in Zephyr and how you can add it into your debugging workflow to augment your abilities to diagnose faults and errors in your system. So a little bit about myself. Again, my name is Eric Johnson. I'm a firmware solutions engineer at MemFaults. I help our customers integrate our SDK and then also spend time maintaining our MCU SDK. Previously, I worked at Walgreens Health, Athos, Acuity Brands. And while I was at Athos, I attended the ZDS component of Open Source Summit in 2018. That's where I got my first exposure to Zephyr and my main goal attending that conference was to evaluate Zephyr and see if it was ready for us to use. And I've brought it to every single place that I've worked since then. I met our founder, one of our founders, Tyler, at ZDS in 2022. And then as to my contributions to Zephyr, a handful of just some humble bug fixes to the BLE kernel and shell subsystems that I've encountered while developing devices. So brief agenda on what I want to talk about today. First, I'll give an overview of what is a core dump, what's contained in it, and then I'll move on to how do core dumps fit in with Zephyr, go over how are they triggered through the Zephyr cert handling and the fault handling that you get out of the box with Zephyr. I'll cover some of the components of the core dump subsystem and then move on to the host tooling that's used and some scripting that you can do to start to build up tooling around processing core dumps. I'll finish with a demo and some comments on some future work that I hope to contribute. So Zephyr comes with a really great logging subsystem built into it, but logging can only take you so far in some cases. We've all been there where you've got a component that produces a really spammy log message or you just collected a bunch of logs over a long period of time and it's hard to sift through this. The built-in panic message to Zephyr is also great. It's most Rtopsys that I've worked with don't give you one. And so being able to understand when there's a fatal error versus, and getting some insight into that versus just your device rebooting is really great. But the problem here is that it's a static picture. It gives you a lot of the details but you have to piece the picture together yourself. So this is where core dumps come in. These are triggered by faults, kernel panics and assertions in your code. And what a core dump will do is it's gonna capture information from your device on what caused the crash specifically. There are different back ends that you can use and I'll go over those in a little bit. But the key thing is that we either want to stream it out immediately or store it to some non-volatile memory because in this state the device needs to reboot. So some basic data components of a core dump itself. What we're doing is we're capturing different regions in RAM and saving them into a binary format. In order to decode your threads, you're gonna want the kernel structure that your application is running. This will give you insight into the state of each threads when the fault happens. We can also, from this data structure, determine the stack use for each of the threads. You can also capture any other data that's important to you. Maybe you have a sensor running and you've got some sort of ring buffer or you're using some memory slabs and you wanna capture the state of that data structure. You can include that in your core dump for later analysis. Maybe you are wanting to know how exhausted was the BLE stack memory components that are being used so you can understand if you need to increase the size of those buffers. Maybe you've got some dynamic memory that you've allocated to the heap for lists or other data structures. Those can all be included as part of your core dump. So how do we get started? How do we move into collecting a core dump? So I'm starting off here showing the call graph that Zephyr uses when an assertion is triggered. So there are assert macros. They all start with double underscore assert. And when that is called, it calls into this assert post action function which eventually calls some arch specific code that will then trigger your fault handler on your device. The example I'm showing here is for Cortex-M device. So Zephyr uses the SVC interrupt to trigger this. And one thing to note here is that the assert post action will, when we look at the back trace generated from an assertion, the assert post action will be at the top of the call stack. So frame zero will be assert post action and then frame one will actually be where the assert was called from. In the case of faults, these are a little bit simpler. The essentially the fault happens, it triggers the fault handler and then begins running Zephyr's several functions within Zephyr that do some arch specific things and then eventually call into the Cortex-M subsystem. And the key thing here is that frame zero will be the site of the fault in your code. So this diagram outlines a few components of the Cortex-M subsystem. You can think of the front end as the component that's responsible for formatting the memory regions into a core dump. And the front end then drives the back end which is doing the IO to either stream out the core dump in the case of using the logging back end or flash or your custom one that it'll write it to your storage medium. The back end is configured through a Kconfig debug core dump back end. And then as I said, there's a logging back end that can stream out of your logging console. The flash back end can use partitions defined in device tree for this function. And then there is also the option to implement your own. In terms of what sort of data sources are we capturing? We're gonna capture arch specific regions and these are things like the core registers of your MCU, the fault handling registers, anything that's really specific to your architecture. We'll capture the stacks of each thread. And then there are some newer features where you can define nodes in device tree to capture device specific data. And then you can also at runtime define different memory regions that you would like collected as part of your core dump. So the example that I'm showing here is a device tree node that I've defined to capture the GPIO block of the the Kimu Cortex M3 target. And this is based on a TI Stellaris chip. So I pulled the memory region information from that data sheet and actually from the device tree board file for that target. There's two things to note here. The core dump type and you'll see that it says mem copy. This is because we're capturing a core dump region that that is memory mapped. So it's safe for us to perform a copy across, starting at that first address shown in the memory regions property and then use the second piece of that, the length. There's a different type of device memory region that you can capture where you implement a callback and that is for cases where you can't simply just copy RAM. So this is an example of the output collected using the core dump logging backend. You can see it's encoded as ASCII text. We still get the panic print, so that gets printed off and we're gonna use this output, save it into a file and then that will become input when we start up our GDB server to load up the core dump itself. So as I was mentioning, we take that text output from the logging subsystem and we use a log converter that Zephyr provides as part of some built-in scripts and we'll convert it into a binary format. Zephyr's GDB server uses a couple different parsers. One is the ELF parser and what that does is it uses your symbol file from your applications build to load your text and read only memory regions and this is used to correlate the PC values to a specific function call and it allows you to move up and down the call stack. The log parser then collects all that memory that we collected from SRAM and builds up memory regions that GDB can query as we use our scripting or just run our normal GDB commands that you're used to. The GDB server is provided by Zephyr and this implements the GDB remote protocol and what that allows is a client can connect to the server and then the remote protocol queries those memory regions that we've built up from our core dump and our ELF file. The GDB server uses what's called the GDB stub and that implements some basic commands that are part of the GDB remote protocol and there are some arch specific code that is required and so Zephyr has support for different architectures through that. The final piece in this chain is the GDB client and that's what you typically load up as part of your normal debugging workflow. So as part of GDB it does have this Python extension support. You can create Python scripts that you can use to run commands in GDB and it really helps you automate a lot of different tasks that you might want to do as you're examining a core dump. A few things to note is that Zephyr defaults, the Zephyr GDB executable defaults to the no Python version. So when you're trying to use your, examine your core dumps and using some of your scripting extensions you'll want to add on the dash py suffix. The other thing to note is that when GDB is built it sort of hardcodes an expected system Python location and so you may run into errors when you're first starting out where it says it can't find the Python energy system and so you'll want to install it into kind of the typical system setup for Python. One thing that we really recommend to use is a virtual environment for all of your GDB scripts. This just helps keep the rest of your system isolated for many dependencies those scripts might have and then we have a package that is called GDB bundle and it helps automatically detect GDB scripts that you've written in Python in your virtual environment and brings them and imports them so that when you're running GDB those are just picked up automatically. There's a lot of annoying and tricky things to set up to get this process started and GDB bundle helps you do that in a much more manageable fashion. So I will show a brief demo, let me switch over. So in this first tab here I'm just, I've collected that output that I showed on a previous slide into a log file and I'm gonna use the built-in log parser that Zephyr provides to convert this into a binary file that GDB server can load. So I'll go ahead and run that. What I ran that with is the Python script. The first argument is the log itself that we collected and the second is just where it's gonna be output to. So now we have our core.bin. In the second window here I'm going to start up GDB server and the two arguments I'm providing are the symbol file that we've built, so that's zephyr.elf and then the core.bin file that we just generated in the previous step. When I run this, this script is going to sit there and wait until a GDB client connects to it. Now, before I run the GDB client, I'm gonna show a GDB init script that I've written. This uses a snippet of code that we use to initialize GDB bundle and that's kind of the first half here. And in GDB, these are all essentially the commands in yellow are GDB commands, so we start off by Python and then we write some Python code. This will import GDB bundle and then initialize GDB bundle to set up our Python path to collect all of the GDB extension scripts that we've created. And then finally I've just added a command to automatically connect to that GDB server that I started in the other tab and it's waiting for the client to connect. So I'll exit out of here. So here I'm using the zephyr tool chains GDB, again with that dash py suffix. I'm passing in a command option to specify where my GDB init script is and then finally we're gonna load the L file that we've built with. Excellent, I know what the issue is at least. Did not have my virtual environment, it activated. Okay, so great. At this point what we've done behind the scenes is load our build, load the core dump and GDB has jumped to the location where we hit our error in our application. So you can see, and I'll make this a little bit bigger, and see at the bottom that we've stopped at calculate transformed reading. This is sort of a toy application that I created for the session here today and we've paused here. And the great part now is that we can run most of the GDB commands that you know and love when you're doing regular step debugging. So we can backtrace to see where in our code we came from. In this case I have a thread called processing thread and we hit an error at calculate transformed reading. To show off those extensions that I was talking about we have a couple little demo. I first created one to parse through the GPIO peripheral on this target. And so this is just called GPIO. And what it's doing is it's going through one of the GPIO peripherals and walking through each of those pins and kind of spitting out some processed output there. So I'm converting the memory that we collected in our core dump and kind of parsing it to understand whether or not the pins were enabled, what directions they're set to, their current values, the state of their interrupts. Another kind of simple script that I'll demo is we can print out the state of the other threads in our system. So here we can see that we have our processing thread as we saw that's where the core dump occurred. But then we can also see the state of the rest of our system. We can see the system work queue, shell UART, what our logging system was doing. And these are very simple examples but my hope is that it gives you ideas on how you can really expand your debugging capabilities. So I'm gonna jump back to the presentation. So what is some future work that we can do to the core dump subsystem in Zephyr? So as part of my research for this presentation, I found kind of a serious issue with the flashback end and that's that it is not totally ISR safe to use. And this is important because when we hit a core dump, we're in a fault handling context. So we cannot use interrupts to, we can't use interrupts. There's no context switching. The core dump subsystem is just running until the system reboots. So that will need to be reworked because the flashback end uses the flash, Zephyr's built-in flash API and there are different semaphores that are used there. So as soon as we take a semaphore, if we have assertions enabled, which is a powerful feature for using core dumps, we go into a lockup and so this isn't really a suitable backend at the moment. A few other improvements that I hope to make are allowing the application to override a linker memory region that is specified as part of the core dump subsystem. So right now you can enable a Kconfig and it will use some linker defined symbols to capture essentially all of RAM. And a lot of times it's not practical. You don't have enough space in your storage medium for that. So allowing kind of a more application defined region I think would be pretty useful. I think what we can do to help triage and diagnose issues that users are reporting to Zephyr is to teach users to submit core dumps with each of their bugs and issues. Then a maintainer can load up the core dump and get additional context onto what caused the falter error. I'd like to add an example using the flash simulation region as a backend. And what this can do is give you a simple way to have your device not connected at your desk but still collect a core dump. As I said, the flash backend has a few issues that need to be worked out, especially when assertions are enabled. So we could use the flash simulation region to store to a no init region of RAM and that way the device can reboot and the core dump is still in that section of RAM. And then finally, I think there are some West extensions that I'd like to write to kind of take a few of those manual steps that I was doing and automate them. So a little bit on where to find me, LinkedIn, GitHub. We, Memfalts, the company that I work for, we write a blog called The Interrupt and we put out a lot of different embedded systems topics. Highly encourage you to check those out. We have a lot of information on scripting and GDB. We also, I just put up a post on the RingBuffer API and Zephyr and how best to use it and how it's used in practice. A little bit about what we do at Memfalts. We obviously, core dumps is one of our key features but this is what a core dump looks like in Memfalts. So we do a lot of additional work to add in different details, things like state of threads, detecting stack overflows and then also, this is just a peek into the different memory regions that we collect and how we process them. Just a few quick acknowledgments. I'd like to thank everybody that does maintain the core dump subsystem. I hope to start contributing a lot more to it in the near future. And then also to Daniel Leung, who gave a talk a few years ago on more on the guts and the internals of the core dump subsystem. And then just included with these slides the links to GDB bundle. Please go check that out. We'd love feedback on it and then the GPIO command that I implemented as part of the demo. Thanks. He's right there. Amazing stuff. Thank you very much. It was very interesting and fun stuff to be doing. I can guess a few answers to this question but I'd like to hear your answer. So why did you go with the GDB server approach instead of writing ELF based core dump file and then loading that in GDB using normal means? So in this case for the demo I was using the login subsystem output and so that kind of requires that intermediate step but you certainly could collect the standard ELF format and then have that same GDB server provided by Zephyr, load that in. But like, I mean on Host Linux I would just take a core dump file and I would say GDB ELF core dump and GDB would load that. So I was talking about that approach so without the GDB server. In that case, I think that there are some, I'll be honest, I'm not completely sure of the technical differences there. If I had to guess, it's because we're, I think the typical GDB server might not know of some of the memory regions that we're loading as part of the Zephyr specific GDB server script. Okay, okay, fair enough. Thanks a lot. There was a talk yesterday about tracing in Zephyr and they used retained RAM for storing the triggers. Could you imagine also storing your core dump in to a retained RAM? I think Linux can also store, I don't know if it stores the core dumps but it can store lock into ECC protected memory that you can read out on the second boot. Did you consider doing something like that? Yeah, definitely. I might have touched on that indirectly in the future work slide. So I think the way to do that right now in Zephyr would be to use the flash simulator functions that are built in now. At least I was kind of perusing GitHub to see if anybody else had done this in the past. And that was actually suggested and a feature request was closed because you can use the flash simulation region to mark off that section of RAM. So as long as you define the address of the RAM region that you're referring to, you should be able to use that there. But you would still go through the flash layer and have the problem with the semaphores or why is it called a flash simulation? Yes, that is true. So I think then the second piece there was, and I didn't mention this. So this is great. Thank you for pointing that out. The other piece of work that I'd like to do is just adding in another backend specifically for that. Okay, thank you. Any last questions? Great, thank you, everyone.