 So, greetings. Welcome to my talk about debug prop. So, my name is Johan Fischer. I am embedded system developers for Phytec and my job is to design of development boards, radio modules and also software components drivers. So, and one of my still active project is the design of the real board. And this is also the motivation for this project. I will describe what a debug prop is and what we need to implement in Zephyr for it and what I had done in to get it working in Zephyr. So, what is a debug prop? It's a bridge. It's a device which serves as a bridge from an interface like USB or internet to a debug port of a microcontroller. And there are few, well, no debug props. For example, a Sega J-Link or Sega J-Link OB. You can find this prop on, for example, Nordic boards. A Stelling. So, a Kyle J-Link prop as external device and at least, for example, doubling based on-board debug props. So, here is a board with on-board doubling based debug prop. And on the top middle of the board, you can see the debug prop itself. It's based on the 20DX SoC from NXP. It's a very relative small circuit and the firmware here is based on the doubling project. Next, what is the doubling project? It is great open source project license and a page license. It is well supported by OpenOCD and it offers at least three interfaces. It is USB-CDC-ASM for logging on, for example, tracing for debug messages or we use it also as shell and ZFIR. The next one is USB human interface device for debug channel. It is based on CMC-STEP standard and USB mass storage device support for drag and drop programming. And also it offers a bootloader mode for updates over USB mass storage device. So, we can update the debug thing by itself. And there are a few drawbacks with the project itself. We need proprietary tools to build it and it's only possible on Windows OS. There is a way to do it on Linux, but you still need ARM compiler and license for it. And the code itself from CMC-STEP is not well obstructed and it's more an example how to do it. But the CMC-STEP itself is a vendor standard from ARM and it describes how to access the core site components from or via USB. And from USB view or from user view, what features provide a debug problem, for example, that the future that we use every day or for daily work is, for example, for this, for make flash or make attach, make debug or something like that. It is a human interface device and a CDC is for debugging. And the human interface device provides a debug channel and this channel can be used with every target. It stays no vendor-dependent code involved in this communication. So I can use the same probe to flash or to debug Anodic SOC, NXP SOC or SD SOC. And this optional feature is a USB storage device support for drag and drop programming that's very handy for beginners. You don't need the deep knowledge about this build process or this flashing process. Just attach your board to USB and copy the firmware over. And in this case, it depends on the vendor. You can just don't change the probe and use it with another SOC, another target because there's a flash algorithm on the debug probe. And there's also a project to build this algorithm and it's always made for specific target. So that is my motivation to build up with debug proc with Sapphire Ados. It should not force me to use proprietary tools and I'm the fire developer. I used this operation since I would like to have 100% on Sapphire Ados-based development board. It's the second best operation system in the world because the first one is Linux. We could also have different interfaces for the host. For example, we can use Ethernet or Bluetooth instead of USB. We could also implement a GDB server on the probe itself and connect it over Ethernet. And it should also be possible to use different ESWD drivers, for example, a faster one. And we have a few other USB classes in Sapphire so we could combine it with other human interface devices or Ethernet class, I don't know, something else. So my approach was, is to use the interstep reference and adapt this debug channel to the server Ados. So what we need to implement in Sapphire, the first is the connection to the interface of the target. So ESWD is the most used interface on SOCs, but we use in Sapphire. So I limited my work to ESWD driver only. There's no GTEC support and yeah, we need an ESWD API for the driver. We need a controller or handler which mediates between the host interface like USB and the driver itself. And it should not depend on the host interface. And at least we need a host interface, for example, over USB. In this case, it is a human interface device and we already have CDCM support. But what we still need is there is a review in, there is a PI in review. It is serial number, use support for serial number. And yeah, yeah, and of course the hardware, we need a hardware. And in doubling approaching codes, it's the hardware interface circuit because it's built an interface to the target. And yeah, these circuits are very cost efficient and they're always based on ARM because it's ARM, so these are ARM's targets. So there are a few IO compatible circuits and the INXP boards or Friskill boards uses Kinetis, G20 or G22 based host interface circuits. Atmol, as Autonautic uses Atmol and INXP use LPC. So the connection to target is usually over serial wire, it's two line connection and it's a UART for debug message and trace it. Just a small notice that the host interface controller, the hardware interface controller on the real board cannot be used for development because of RAM boundary issue on Kinetis. And it's, we do not have enough RAM for it. It's for Kineti, it's half of 16 key. So it's a simplified circuit and yeah, you can see that we need more GPUs on the debug probe itself to emulate this interface because there is no native implementation for SWG in the controller. So it's used for debugging for the debug board. But what is the depth, self, what does it mean? Debug access port, it's a implementation of ARM debug interface and debug access port itself consists of at least of two main blocks. It's a debug port that provides connection to the debugger, so everything in direction of to the debugger. And it provides an interface to the access port. And access port itself provides access to debug resources. And there you will see there's few, for example, for serial wire, the implementation of debug port is serial wire DP or SWDP. And it is simplified overview about the entire system. And they're on the right side. You can see Zephyr developer using nice and free open source tools like GDB and PyOCD to access debug resources. And what I'm talking about this project, it's just small part of this entire system. It's a debug probe. And it is connected over serial wire to the debug access port. And debug access port is a part of SOC and it's connected to the target. And you can see it is a few like ROM table and debug register, system memory, access ports, debug ports. Maybe it's familiar to you. And if you call the PyOCD with verbose option, you will see how PyOCD explores the SOC to find all the debug resources. So what is serial wire and SWD protocol? Serial wire is the serial interface with a clock and bi-directional data signal. The sampling is on the rising edge of the clock and a data line can be driven by the host and by the target. The transfer itself consists of two or three phases. The first phase is packet request phase, octet, following by acknowledge phase, and if its response is okay, then by data phase. Here we can see two samples for write and read. The first one is write transfer. I know what is one. It's two access port and the other one is the read of ID register from data port. And one thing with SWD protocol is that it's for entire transfer, we need 46 clock periods. So it's not multiple of eight. So we cannot use another, for example, we cannot use serial verifier interface to emulate SWD. So we need to implement it as BitBank, for example. And as they have, for example, in the request phase, the line is driven by the probe and the acknowledge phase is driven by the target, by the access port or by DP. So we need always a cycle turnaround between the phases to switch the logic, for example, the in output pin to in pin and the same after acknowledge if it's a write request. There are two responses possible, okay, wait and fault. There is no explicit response for protocol error. That data line needs to be, it needs to be hold high. There are pull ups integrated in the access port or mostly in the access port. And protocol error can be also detected, then we have just all ones in the acknowledge. The other thing is SWD can buffer access port for writes. So if you write to access port, it doesn't mean it was successful. Also for okay acknowledge, we need to check it again. And for access port reads, it's always posted. So we need another access to read access to access port to read the data. Or if it's buffered, then we can, if it's the last request, we can get it over read buffer from debug port. Just few details that we need to know before we can implement controller and the driver self. And yeah, the driver, and because there's no dedicated interface for WD, the driver needs to be implemented as speed bank driver. And it requires fast GPIO API or GPIO drivers in the system. But our API in ZFIS is very slow. It cannot be used for that. And the workaround is to write direct to the GPIO registers and the BitBank driver self. There's a workaround also to detect the port which port is used. And the logic of the driver self is the reference implementation. It's tested on NAF 52 and then Kinetis 64. There are few problems with the driver. That's the clock frequency cannot be set accurate. A few difference, small difference. And it's, the driver self supports only the circuit based on two, three state buffers. So API ZFIS is experimented and tested only with the BigBank driver. But theoretically, it should be possible to change the driver and use another implementation, for example, in FPGA. Performance of the driver. So with the current implementation and NAF 52 CPU running at 64 megahertz, the maximum is WD. CRL via clock is about 4 megahertz. But it's only reliable up to 3 megahertz because there's some problem with the higher frequency. There are few problems with how it's read back the signal state of data line. And it's a plain seed. There's a delay between setting of the edges and it's implemented in line assembler. But other things are just plain seed. It's, we can say it's portable. Something like that. But it can be optimized with inline assembler. And most of the performance can be, more performance can be a hit for the right transfer. I made it up to, for the right transfer to make it be possible with 8 CPU cycles using conditional execution on Cortex-M4. And it is possible to get to clock frequency to 8 megahertz. But it's not, it doesn't mean that this overall performance will be better. There's some drawbacks because the inline assembler court is more or less understandable and less portable. I will show just an example. This is Fright's 8 or 32 bits to a pin. And yeah, it's not really, I don't know, it's readable. But it needs just 8 o'clock. So if someone get it shorter, I will give him a real board. So the other part what we need is a depth controller serve. It is a handler between the host interface and WD driver. Also this implementation follows same system preference implementation and it expects a request buffer from the host interface and passes a response buffer back. Just for us a simple request to the driver. There's no rocket science. And this implementation of the depth controller should also be independent of the host interface. That's almost done. The problem is the registration and initialization of the depth controller. The current state is that only is WD support is possible and the index is not used. It's also not used in piety and doubling. This is a reserved feature. So I understand that this should be possible to have different, more than one SOC connected to a probe. And yeah, the implementation works well with PSD and AM debug interface 5, which we can find on Nordic SOCs on Friskal SOC and I tested it on microchips MD 21. It's over you about the entire implementation. The USB human interface device class, it's the implementation from the CFR or from USB stack. And in the middle is that human interface device could be removed or interchanged with the internet interface implementation or something else. For example also shell and in the future. The depth controller or handler itself uses M box and mem pools. I think it's the only implementation of it in the subsystem. No one uses it. I don't know why. It's very, very handy. And yeah, there's a CWD driver with the pins. So the host package structure, there are two important request types coming from the host interface, which has influenced the performance of the entire probe. One is the host may submit a request, a request packed in a block, that's also a block of request. And the other one is the host request about a block of data, for example for burst rate or burst write to the target. And so the next part is the host interface via host interface device. It's very easy to implement because it uses a human interface device from CFR. It provides same-to-step compatible interface and can be combined via other functions like CDC-SM, mass storage device, internet, and anything else that we have in CFR. But it still needs a few more improvements, like improvements like runtime registration, deregistration, and echo-existence with another host interface, for example, depth shell or interface via internet. So it should be something like hot block face or if you disconnect USB and connect internet, it should still work. So I love to have a shell implementation. Interface performance via USB, it gives the entire performance of the entire system and for few speed devices and end point sites of 64 bytes, we can have about 1200 bytes per frame and there are 1000 frames per second for the full speed. But what the SWD provides or give and the performance is that default clock period is about one microsecond and the one transfer of 32 bytes takes about 60 microseconds. So the burst rate of 64 bytes of a block, something, is about one millisecond. So there's a lot of what we can improve for the throughput. It's far below the possible USB full speed performance. And typical throughput is about 23 bytes, a kilobyte per second. Yeah, it's not that good. But it's the same with the capacity. It's not the same. Few details about the tools for debugging that I used. It's the Seagrock and PulseView. It's open source tools for logic analyzer. And the tools they have integrated is WD protocol decoder. And also other decoders for the debug resources like ETM, ITM on TPU. And the next steps I would like to do is to implement isWO support and debug access support, host support in the NSFIR. Or alternative, yeah, maybe later alternative host interface over internet. Yeah, few samples from logic analyzer. And the first one on the top is from the Debling with default clock of one gigahertz. And you can see there's a lot of jitter on the clock line. Because for the right transfer, the clock is about 800 megahertz, a kilohertz, and it's dropped to the 600. And the other one is SFIR based prop with default frequency with one megahertz. And it's a little better. But it's still not the clock we, the user sets. So it's for one megahertz we have 960 or 900 kilohertz. And this is captured from the jailing on the freedom board. And it's much better that Debling and SFIR based. There's, yeah, no jitter at all on the clock line. And the next step is to yeah, this is just overview about the file structure, SFIR structures and questions. You mean that one. Why is the compiler not to be able to optimize it? The compiler is not that bad. So it's why I say the overall performance is unknown. But it's, I mean, as a human user, it's you know what you want to have on the clock line. You know the needs, yeah, it's for the compiler it's not possible to know. So it's, but I don't know if it's, it's a, we'll be better. How I say, overall performance is just unknown. But it's, I think it is possible to have it faster in line assembler. But it's done, yeah, not, it's not good for the people. I mean it's not readable. At the end of the slides there are a few references to the implementation of the Dublin self and the embedded hyperlinks and PDF and also references to the tools and specs from ARM.