 Thank you all for coming. This talk is going to be on no USB, no problem, GrainUM, a software-only USB stack for, in this case, specifically 48 megahertz ARM Cortex M0 plus CPU. So that's a very specific CPU. But it turns out that it's actually relatively common. During this talk, I'm going to talk, it's going to be three sections. The first section, why would anyone want to do this thing? Why would you want to bit bang a USB stack on a Cortex M0 plus 48 megahertz CPU? Part two is going to be how it's made. And so this part is going to be where we include things like scope traces and code snippets and things like that. This is the real meat of the talk. And then we're going to go on and talk about what now. Now that we have this USB bit bang stack, what can we do with it? What can we actually do with a low-speed USB stack? First, a little bit about me. I first presented at CCC a couple of years ago talking about SD cards, getting code execution on them. Very embedded system. It's a little 8051. You can't do a whole lot on these except SD cards. I'm also known for working on the software on the Novena open-source laptop. Thank you. And the laptop had this battery board that's actually closer to what I'll be talking about today. It has its own CPU on it that runs. It's a Cortex M0 as well. So this really is applicable to the USB bit bang stack that we talk about. Now, the takeaways I'd like you to have from this talk is I want you to have a better understanding of low-level USB. A lot of how-tos and books and things like that, they'll talk about the higher-level things, about descriptors, and you have a chip now that supports USB. What do you do with it? Not many of them talk about the low-level protocol stuff. So I want to tell you a little bit about the protocols at a low level. I'm going to have some tricks that you can use to improve your embedded programming. Embedded programming is really different from web stuff, from desktop stuff. And there's some tricks that you use for embedded programming that I'd like to share with you. But I think the most important thing is to know what's important and what's not. When you have a new project, when you have a new protocol you're trying to implement, there are things that are very important that you absolutely cannot get wrong. And then there's this whole other section of stuff that they say is important, but it turns out it really isn't. And it's important to know what's important and what you can kind of punt. You can worry about later. With that, let's get started. Why would you do this thing? Well, the theme for this event works for me. USB is really, really easy if you're a user. It can be a little bit difficult. So USB is great. You can plug it in. You don't have to worry about any sort of configuration. There's none of this anymore. No environment variables, no interrupts. You plug it in and it works. The other motivation is that USB is everywhere. A friend of mine has this USB Pokeball for some reason. It literally is everywhere. So everything supports USB. So if you want something to work for somebody else, we probably should go with USB. A couple of years ago, we did this. This is a Fernvale. This is a mobile phone, CPU, kind of a breakout board type thing. And in the middle there, you can see a kind of a white box. And that is a UART. That's a serial port. It runs at about 3.3 volts. And it is really easy to get going for developer because you write a value to an area in memory and it pops out the port. But from a user perspective, you have to go and you have to find a TTL serial cable and then get USB drivers working. And it's just a huge pain for the normal user. And towards the end of the development cycle, we got USB support working natively. So all you had to do to develop on the CPU is plug it into the USB port. It had power. It had USB serial communication. And it was really easy to bring it up. And this actually surprised me because I have the serial cable set up. I mean, I thought that was relatively easy. But just having USB serial made it so easy for other people to get going and start contributing code that it really stuck with me, that USB support on the thing is huge for enabling users to use your product. Now, fast forward to last year. Bunny did a board for Burning Man, codename Orchard. If you look, you can see the antenna sticking out the side. In Burning Man, there's some special requirements when you want to do a badge. You're going to have mesh communication type stuff, not necessarily internet of things. You don't need a gateway, but you just need a mesh so they can talk to each other. Of course, this being Burning Man, it had lots of blinky lights. And it's a really cool project. But if you flip it over on the back, you can see there is a big black box. And that's a battery. And there's a USB port. And for this badge, this actually acts as a phone charger. It's a portable battery pack like you'd get at the airport when you're running out of battery. That's because the CPU doesn't actually do USB. It has a radio in it that does 900 megahertz. And it's fantastic for mesh networking. But it doesn't have USB support. We took the Orchard core and reworked it. We did a project with MIT. And now the port has gone from type A to OTG. And Bunny did a thing where he decided to wire up the D plus and the D minus pins to the CPU in case somebody decided to do a BitBang USB stack. At the time, I thought, now, why would you want to do that? But fast forward a couple of months, and I was talking to somebody. And they wanted to take gaming controllers. What they'd done is they'd taken Raspberry Pi's. Daisy chained them with OTG cables. And they were having the pies in a box with an arcade stick. And I thought, well, I have this wireless radio. All it needs is a USB stack. That's really all it needs. How hard could it be? So Project Palawan was born as one of, can it be done? Now, to be fair, there are other software USB stacks. VUSB is the most famous. It's a really impressive piece of software that gives low-end AVRs, such as Arduino, full USB, low-speed stack. It's really impressive. It does it all in 12 megahertz. But of course, the big problem with that is it's for AVR, and we want to do it for ARM. The other software stack that I'm aware of is, I'm actually not sure how to pronounce this. I think it's like MCUSB. It is very similar to what we have. It's a Cortex M0 Plus, also very impressive. But they both have one problem that makes it a little bit difficult, and it was a problem that I wanted to fix. And so one problem with hardware is that it's kind of a minefield. If you have the documentation, even if you have the reference manuals, there's all sorts of things, pitfalls and traps that you might miss. This is a line out of one of the reference manuals. It says port A is assigned a dedicated interrupt, and port C and port D share an interrupt. And then it goes on to talk about DMA. Now, if you read between lines, you notice they don't actually mention port B. Port B exists, but it has no interrupt. And this is the kind of problem that you find out about two weeks after you get the board back from the fab. And hardware is just full of these things. Here's another one. It says PTB3 and PTB4 are true open drain pins, which sounds great. It goes on to say you need to use an external pull up, which basically in English means that PTB3 and PTB4 are input only. You can't use them for output. And again, you find this out about two weeks after you get it back from the fab when you're actually trying to figure out why is the I'm sending a value out. Why isn't it coming out? Oh, it's because this pin is input only. And this is just all over the place. Here's a pinout diagram for one of the chips. It's really small. But the thing to note is that pins have multiple uses. So for example, a PTA0 on a chip, you can use it as a GPIO. Sometimes you can use it as an ADC. But a lot of times PTA0 is one of the debug pins, and so it's reserved. And both of the other BitBang USB stacks reserve PTA0 and PTB0, or pin 0 and pin 1, as they're D plus and D minus pins. And it's completely invariant. This makes the math really good, really easy to work with. But I wanted to have a special feature where we could have different D plus and D minus pins. So it would work on one of these things if we really wanted to. It's a really tiny chip. I think it's ingestible computing is what they call it. But so doing the math, USB low speed signaling is 1.5 megahertz on a 48 megahertz chip. That gives you 32 clock cycles per bit. And the target CPU, again, is a Cortex M0 plus at 48 megahertz. Now, the plus is actually really important. The ARM Cortex M0 is a three-stage pipeline. The M0 plus, they knocked it down to two-stage. This is a screenshot from the ARM programmers' reference manual. If you look, everything on there, more or less, is one cycle. So anything mathy if you want to add, if you want to subtract, if you want to shift, or anything mathy is one cycle. Anything involving the program counter is two, so most jumps are two. Anything loading or storing from RAM is two. And anything weird, like a memory barrier or a load, a coprocessor load, is three. So when I ran the numbers, it seemed like it could be possible. And so with that in mind, I designed the board Palawan, and I got to work. Spoiler alert, it actually does work. So let's go on to how it's made. Granium is the name of the project, and it has roughly this architecture. So at the top you have user code, which is things like descriptor callbacks and how to get buffers. And if you want to use it in your project, this is the thing that you'll be writing to. Below that is the state, and this keeps track of where USB is in the state, and it's called graniumstate.c. Below that you have the PHY, which handles all the low-level PHY stuff. And below that you have PHYLL, which is the hardware-dependent implementation. And one of the great things about this is everything here, with the exception of PHYLL, is written in C, which means that it's portable to anything you want. You could relax the separate D plus, D minus pins and port this to something that is slower than an M0 plus 48 megahertz. And it would work just fine. So what I'm going to do now is I'm gonna start actually slightly below this, and we're just gonna work our way up, going over the basics of USB. To start with, we got electrical. I mean, everything's basically wires. And one of the cool things about USB is they actually standardized the wiring color. So if you cut open any spec-compliant USB cable, you'll find these four colors. Red is five volts, ground is black, D plus and D minus are white and green, and I actually forget which one's which, because largely it doesn't matter. As for the schematic, this is the schematic that I used. And the thing about hardware is it's really hard to make patches, so you tend to put in a lot of extra stuff. So for example, USB low speed requires a, I think it's a pull up on the D minus pin to let it know that it's low speed. But you also put in a pull up on the D plus, just in case you get it backwards. I also put provisions for pull downs in case something had to be pulled down for some reason, and you put in shunt resistors and a whole bunch of ESD protection. Basically, if you move all the extraneous fluff, your schematic could look like this. So in terms of wiring, it's not too bad to get it working. And as for the traces, this is what the PCB ended up looking like. I'll go into that later. But if you plug this PCB, if you plug this into a Raspberry Pi, it is going to say this, new low speed device number four, and then it's going to give a descriptor error. And the fact that it says low speed device means that we actually got the pull up correct. So great, and that means the hardware seems correct. Incidentally, the error minus 32, if you look up that error note, that's a E pipe, which just means that, hey, it tried to access endpoint zero, the configuration pipe. And well, we haven't written any code yet, so it didn't get anything back. So yeah, broken pipe, that makes a lot of sense. So great, that's the physical stuff, that's all the electrical stuff, about just the wiring. Let's talk about signaling. Let's talk about what gets sent over those wires. USB has three states. There's JK and SC0. K state looks like this, J state looks like this, and SC0 looks like that. I always get the K state and the J state mixed up. Largely, it doesn't matter. USB is concerned about the transition from one to the other, and in fact, low speed and high speed, the case are low speed and full speed. The case, the J state and the K state are opposite. So there's no SE1, if you have a single ended one where both wires are high, that means bad things have happened. You have a short, you screwed up your code, whatever. The host is going to disconnect you and it's gonna reset the line. I think something has gone horribly, horribly wrong, which it has, and so you basically lose the connection. So great, you have the signaling, decoding states. USB is over time, it's self-clocking, and it's concerned about transitions. So if we sample at one section and then wait 32 clock cycles and sample again, we can tell if it's the same sample, it's a zero. If it's a different sample, it's a one. And it would be really great. This is actually an exclusive nor is the type of gate, which we don't have an opcode for that. We'll have to use two opcodes, one to negate and one to XOR. But it's not too bad. Going up the stack a little bit more, USB has this 8-bit preamble, which is great. This is basically where it goes from the K state to the J state, KJ, KJ, KK. And the nice thing about this is each bit, remember it's 32 cycles, and if you do the math, that means we have actually 256 cycles before the start of data. So that's a ton of time to get synchronized to the pulse, which again, that means that we're probably gonna be in pretty good shape when it goes to actually implement this. Following that is the data section, which is up to on low speed. We're fortunate here that we're limited to 11 bytes of data at the most. And then we get this single ended zero, which just means, hey, it's the end of the packet. And that's when both of the lines go down. Normally they're opposite for the K state and the J state, but if both of the lines are zero volts, that means it's the end of the packet. There's just one more little niggle you have to worry about, and it's called bit stuffing. And it turns out that the wiring doesn't like it if you send the same value over and over again. So if you send six ones, that means that the line is going to be in the K state or the J state repeatedly. You have to add a zero, you have to flip. Here's a scope screenshot of, you can see the preamble, the eight bit preamble on the side. And then I sent a whole bunch of ones, and you can see the state flipping there because it has added these zeros. It's stuffed in zeros that the receiver is going to ignore. Great, so what's important? What can we ignore? And what do we actually have to get accurate? Timing and sync, that is incredibly important because USB is self-clocking. If you get the timing wrong, if you get the sync wrong, it's not going to be able to work at all. Bit stuffing is also important. If you forget to stuff bits, then the signal on the other side is going to come out completely wrong. You have to handle error conditions as well, such as keep alive. The USB will send empty packets to you every once in a while. You have to worry about framing. If you get into the packet after the sync, you have to recover gracefully, and you have to handle overflow. For some reason, the host sends more than 11 bytes. You have to handle that case as well. Not important, you get six bits because you need to look for the KJ state at the beginning, and that's ages to synchronize. And for the most part, we only need to check one wire. We don't have to check both D plus and D minus. We can just check D plus. See what it was 32 cycles ago, and see if it's changed now. The only time you need both wires is if they're both SC0, but the nice thing is you just add the two up, and if it's zero, then you know it's SC0. And one other feature is that USB, the spec allows you to miss up to three packets, so if for some reason we missed the packet, the host will just try sending it again. Now, one thing I haven't mentioned is signal integrity, and if there are any analog people in the house of virtual eyes, this is my trace for the USB signal pins. And if you're an analog, this is not a matched set at all. Normally you want differential routing, where they should be as close together as possible, and they should be the same length. I didn't do that here, and it turns out that's just fine. The scope that I had access to had a thing that lets you generate an eye diagram, which lets you know how unspec you are. The USB spec doesn't even mention eye diagrams in the low speed section, but this is what it looks like, and basically you want to avoid the red bits, and we do that actually really well. So even with the awful routing on there, it works just fine. Actually, that's a bit of a lie. When I went to go take the screenshot, this is what I ended up getting, and this is actually a text message I sent to Bunny. I was like, hey, why is it failing? If you look at the beginning, you can see there is a little bump along the top and on the bottom that isn't in the other one. You can see there's a little bump where it dips down to the red. That's because the pins on the chip that I was using by default were set to a high slew rate. And so for any programmers in the house who have had to do firmware now, and you see that section, high slew rate, low slew rate, what that means is when it changes, it's going to try and slam it high or slam it low as quick as possible, which it turns out 1.5 megahertz. Anything below about 15 is low slew rate. And so as soon as I set it to low, it started looking good like this and it started passing in tests, which is kind of a free tip. In hardware, faster is not always better. So you might think fast slew rate, I want fast slew rate. No, for this case, we wanted slow, fixed the problem and passes signal integrity, passes the eye diagram, no problem. And in fact, because we have separate D plus and D minuses, the bits don't even come out at the same time. There's like a seven nanosecond delay. This is fine. The hardware spec doesn't care about this. We're still within spec, even though the bits, there's a small delay. So USB is a very forgiving protocol in this sense. If you're designing for a chip, you can actually use a 32.768 kilohertz crystal. This is a very common crystal because you divide down by two to the, what, 17th, and you get a one hertz signal out of it. So these crystals are very cheap. The chip that we're using takes 32768 and then multiplies it by 1464, which gets us exactly 47.972352 megahertz, which it turns out is close enough. Like USB, it's not that many bits. It works just fine. I'd like to take a moment to, I mean, that's the hardware, the low level signaling and all that. I'd like to take a moment to talk about the development setup that I use to develop this. And it's on the top is one of the boards I was developing USB on. And on the bottom is a Raspberry Pi. Raspberry Pies are, as you know, very cheap. They're 30 to 35 US dollars and they're everywhere. Everyone has a Raspberry Pi, it seems. And there are, in this case, three wires that run from the Raspberry Pi to the device under test. And these three wires are single wire debug, single wire debug clock and reset. And actually the reset's optional. We can run two wires over if we need. And the board is powered over USBs and so that's where we get the ground from. And on the Raspberry Pi we run OpenOCD. And this is, the Raspberry Pi, normally you'd get a JTAG box and I know Olimax makes them for 50 euro. This is 35 bucks and it has ethernet. It works remotely. And the greatest thing is if you're developing on it and you want to share your project with somebody else if you need help, if you wanna get somebody else developing on it, you just, you're gonna send them a board anyway. Just toss the Raspberry Pi in the box as well when you mail it to them, because you just have to get them running the same software as you. When you're developing on Windows or Mac or Linux, you have to download the tool chain. You have to configure the JTAG box. You have to do all this. With the Raspberry Pi you plug in an HDMI. You tell it to local host port 3333 with GDB. You get stack traces. You get loading code over GDB. It's so easy, it's so powerful. And that's how we do development now. It's game changing, really. So OpenOCD, you can do single wire debugging bit banged over the GPIO pins and it works fantastically for these small projects. There was some other hardware used for developing a really nice scope that I had access to. Occasionally I had access to Bunny's really, really nice scope that actually did protocol debugging. And as soon as I got it to decode USB I knew I was set. I got to use an OpenVisula, which is a very open source USB logic analyzer that is fantastic for decoding kind of questionable and dodgy USB packets that come across. And I had access to a USB Beagle, which is fantastic for decoding nicely formatted packets that come across and it has a very nice UI. So this is kind of the order in which I use them. First the nice scope and then the really nice scope and then the OpenVisula. And now I'm up to using a USB Beagle for decoding protocol and I'm gonna have some screenshots of what that looks like later on. Okay, it's enough about hardware and setup. Let's talk about the actual API that we use. The low level API, remember this is the thing that's specifically tuned for a 48 megahertz Cortex M0 plus. There's two functions read and write. They both follow the ChibiOS convention of ending in I, because they're designed to be called from an interrupt context. They're very timing specific and you call USB Fi read I to get data from the pins and you call USB Fi write to write data out the pins. You never actually really wanna call these functions yourself unless you happen to be getting screenshots for a presentation. Instead you wanna call GraniumCaptureI and I'm gonna go into later why this is necessary. But this will take care of sending responses when you want to respond to something that the host has sent. Now let's get into just some programming tricks. We haven't actually written much code but one thing that we should go without saying but you don't really know until you get into this is you want to run from RAM. If you run the code from Flash, Flash isn't consistent. I mean with the particular MCU that we used, Flash had a cache of anywhere between 48 and 64 bytes. They're not really specific as to what it is and as soon as you exceed that, it has this tendency to add delay slots and delay your code. Which is not what you want when you want performance critical code. So if you're writing assembly code you just put it up at the top section and then give it a section name somewhere in data and to put RAM text in data you add something to your linker script which looks vaguely like this and then GCC will take care of loading it into the correct section and it'll create thunks to get into your code and it's actually really easy to run code from RAM and get cycle accuracy. Also it goes without saying use registers for storing data. If you have with the writer we had a lot of register starvation because ARM only gives you eight to 12 to 14 registers depending on how you count it. And there was one point where we had an extra cycle that we could use and so we ended up using the stack for storage but if you can use registers even if it's one of the high registers that you can't do math with. And finally when you build this hardware you did build two, right? You're building a reader and a writer hook one to the other to use for testing. Send bits out one, read bits in the other. This is your test bench, test it against itself before you test with a host. And in fact this is, you can see for this particular demo there are two different banks of pins that we use and it still works just fine. Okay, let's talk about the low level CP API. Two functions connect and disconnect. Connect sets them to be inputs which lets the host detect that you plugged it in. Disconnect sends it to outputs and makes it seem like you didn't, nothing is plugged in at all. So to simulate pulling it out and putting it back in call disconnect and then connect. And then we have receive packet which is defined as week which you would add a function. You'd override this in your code to do something like call a function to process the data that just came in and you have this capture packet function. And this will take, again, this will take care of the USB state machine sending replies when you need to. With that I'm gonna go over some of the USB packets and why we need that capture function in the first place. I said that a USB packet is 11 bytes and what those bytes are is, well, kind of important. Every packet starts out with an 8-bit PID this is an 8-bit bit sequence that specifies what that packet is. It could be in, out, set up, data one, data zero, act or knack. If it's in, out, or set up it's an 8-bit PID followed by a 16-bit chunk that's an endpoint, an address and a CRC five. If it's a data zero or data one it's an 8-bit PID followed by anywhere between zero and eight data bytes followed by a CRC 16. And if it's an act or knack it's just a PID. There's no additional data. The thing is we need to respond really quickly. So if the host sends us an impact we have six and a half bit times which is not that long at all to either send data, because that's the host asking for data so we actually have to send that data right away or send a knack, tell it we're still working on it, hang on, if we don't do that it's just gonna keep asking because remember we have three chances to respond and if we don't respond it's gonna assume that we're dead and it's gonna reset the link so we have to respond as quickly as possible and if the host sends us some data we have to send an act of correct we have to send an act as soon as possible. Now one trick that you can use to make this easier is some clever array misalignment. The thing about ARM chips is that they're really, really happy when everything is 32-bit aligned and if you see there's this one bit at the front that is kind of misaligning everything what you could do is you can put three bytes of padding in front of it to make it work and then you could just cast the endpoint address CRC 16-bit chunk to a UN-16 and shift it off and it becomes so much easier to do that than work with each of the individual high bytes and then do the swapping and the masking and the ending it's easier just to treat it as a UN-16 which you could do by this clever misalignment and there's actually a macro preprocessor granular buffers that take care of this in the code that you can use if you want to. Now we've got these frames, the question we have to ask is what's important in a frame and what can we ignore, what's not so important? Again, the response is really important. If you don't respond quickly it's going to reset the link. The frame sequence is also important just as if somebody said hello when you respond with, I don't know, banana they're going to be really confused and not know what to say next. So you have to respond with the correct response with the correct sequence. Also the outgoing CRC 16 is really important that has to be correct, if that's not correct then again the system will reset the link and it's also important if there's a data zero and a data one packet you have to get that correct and the spec goes a little bit vague when it comes to that. But if you don't get it correct it's going to think it's missing packets and again it'll reset the link or never get the data. Fortunately there's a lot of stuff that's not so important that we could ignore. We can send knacks, this is what your keyboard does for example, the host will be asking it for data is there any data, is there any data? If you don't press a key there's no data so the keyboard will just respond with a knack. The in, out, and set up packets have a CRC five. We never generate them so we never have to worry about generating CRC fives. We never even have to check if they're correct because we don't have the time to do that. So we'll just assume the CRC five is correct. We can also ignore the CRC 16 because we have 6.5 bit times to respond and that's probably not enough time to generate a CRC of the packet. So we'll just assume it's correct. We'll just throw up our hands. We can also ignore the address from the in, out, and set up. We'll just assume that the hub has directed the packet to us and if it's coming to us if we're seeing it on the pins it's for us so we could just ignore the address and this works reasonably well. Now that we have all that let's talk about the state machine. The USB spec is kind of ugly and the state machine is kind of ugly. This is from the USB 2.0 spec. They actually say this implements the, what is it, the implementation. You're not supposed to implement the state machine directly. You're supposed to use it as a guideline. It's a mess. It takes a while to debug. Just, there's a state machine EPI that takes care of all this for you. It's two functions. Send data, which you call when you want to send data that the host will ask for later. So for example, if you press a key and you want to send that as USB packet in Granium you specify endpoint zero and at some point in the future when the host says, hey, if you got any data for me Granium will respond to the data you requested. And then there's this other function Granium process that will just turn the crank on that state machine. It'll, you get a packet from Capture, you pass it to the process and the USB state machine goes and does its thing. And from this point on, we're just gonna hand wave you that's the Granium. Now you get to the user code, which if you've done any sort of USB yourself, this is what the configuration looks like. This is how you define your particular application. It's a series of functions. And because one of the themes of this talk is what's important, what's not important. From this structure, what's important and what's not important, because there's a lot of stuff that isn't important. The get descriptor function is very important in USB. It describes the device, what kind of device it is, what things it supports. So in the user code, this configuration structure, the get descriptor function is very important. If you don't implement that, things won't work. Get receive buffer is also important because when the host sends you data, you need a place to put it and Granium will call this function. Receive data is also kind of important. When the host sends you data, you need a function to call. But the rest of it's not really important. There's a set configuration number and a couple of other things. These are mostly hooks that could be used in your code if you want to do fancier things. And with that, if you actually set up a get descriptor function and you start describing something, it actually works. This is a Beagle output from, I believe this is a keyboard that I was setting up. It works. At this point, the host can talk to the device and with the proper configuration of the functions, the setup process works. And the setup process is basically just asking for a whole bunch of descriptors. It's not really useful or interesting at this point. I'm gonna talk about, now that you have this, you need to start working on the USB code, the user code, the thing that you as a developer are going to do. And one of the annoyances that you run into in this case is every time you break into the debugger, the host is gonna stop sending naks saying, hey, I don't have data. And the host, this is an example of what it looks like. You can see the host reset the device and started sending setup packets again. But because we were broken into the debugger, the host was dead. I mean, the device was dead. The host correctly detected that the device was in a bad state and it reset it. Unfortunately, this means if you continue the debugger, the continue the device execution after going into debugger, it's going to be in kind of a weird state. So how do you get around that? Well, one tip is that global variables are really handy. Here's an example of, I have a function called loop and then I have a loop count. And if we load this into the system, we can see that it actually is keeping track of the number of loops that we did. And the nice thing about global variables is they're not actually optimized away. If we made this static, if we made it part of the function, the optimizer might say, hey, we never actually read this variable. I'm just going to optimize it away and look, I made your program faster, which is not what we want in this sort of thing. So if you want to figure out what's going on, global variables are wonderful for that. Another tip is that debugging for this sort of thing, we have break points with open OCD, which is wonderful, but they use hardware break points, which are a finite resource. In this particular chip, there are two hardware break points. If we, for example, break on this function, CH sysloc, this is a ChibiOS function, we see that it has a break point at 49 locations. And if we look at the break point information, we'll see, yeah, it did. It went and put a break point at every single inline place for this function, which is no good when we go to actually run this because it says, hey, you had too many break points. One is too many break points. We asked for one break point, it gave us 49. That's not what we want. So we can actually borrow from JavaScript. There's this pre-processor define you can do where you say define debugger to be ASM breakpoint zero, which is nice. You can now use the keyword debugger in your code and anytime it hits that, it will break into the debugger. And this is really nice because this works across optimizations, this works for inline functions, and compiling and reloading is such a fast process that this is almost easier to use than breaking into the debugger with control C or break points or anything like that. So that's great. We have this all done. We have a BitBank USB stack that works, that we can debug, that we can figure out how it works. What now? What can we actually do with it? Could we maybe do a hard drive? No, unfortunately, we can't do a hard drive because we're doing USB low speed and low speed only lets us have two kinds of endpoints. We can have interrupt or we could have control and USB hard drives require bulk endpoints. And if you try to make a hard drive using bulk endpoints, it won't work. It might work, but depending on the operating system, you're breaking spec. Some operating systems will let it work, some won't. The case is the same, we could make a USB audio device. No, those use isochronous endpoints, which again are not one of the two kinds of endpoints we can have. What about a MIDI device? Again, those use bulk and they have three endpoints with USB low speed or limited to two. So you can't do keyboards or MIDI keyboards. You can't do ERDA if you wanted to. You can't do serial devices because those require three endpoints and bulk. So what can we do? Well, one of the things we could do is DFU, device firmware update. This is a spec that the OpenMoco guys helped develop that uses only endpoint zero and it's useful for firmware updates. We can come up with a USB head device like a keyboard or a mouse joystick. These are all under the USB human interface design spec. And there's one more. We could do a vendor defined class, which is I don't know, USB teapot, whatever you want. It's a vendor device. It has no class, but it's whatever you want it to be. It's like Zombo-com. Going back to human interfaces, though human interface devices are easy. They come with their own descriptor that tells, for example, how many buttons on a joystick, how many keys on a keyboard, how many buttons on a mouse. They describe themselves and you plug them in and they just work. One of the other features of USB head is that it doesn't have to be input. You can actually do vendor defined USB HID and you can send data over USB HID. HID is also really easy on Windows. There's no drivers to install. It's handled by the USB HID USB Sys driver and you plug it in on Windows and it just works. There is a WinUSB that kind of gives you similar features to Mac and Linux where it doesn't require a driver, it assumes that the application will handle all that. It either involves reading a special string descriptor or you put it in the binary data store object with a USB 2.1, but I managed to get it working on my Windows machine, but it didn't work on somebody else's Windows machine which kind of violates the whole making things easy. So if you do USB HID, even with a custom vendor class, it's easy even on Windows. One of the problems though is that USB HID isn't super fast. I mean, you're limited to every 10 milliseconds doing a poll. That's the smallest you can make it and since we have eight data bytes, if you do the math, that's 800 bytes a second. That's not very fast, but that's okay. If you want to do an updater, for example, we don't have much flash, so it's fine really. 800 bytes, I'll take it. Current project status, we have reliable bidirectional transfers. That took a while. We have a common code base that works across multiple projects and we have ports to a few different platforms. Palawan was of course the first hardware project that I worked on and it was the one that I did most development on. It works under ChibiOS and uses a polling system timer. There is a bootloader called JoyBoot that I did that's based off of the excellent FadeCandy bootloader by Micah Scott at ScanLime. It has no real operating system and it does all the polling in a main loop, which is nice because that means we now have this software working in two completely different ones, polled, one's interrupt driven, and it works under both. There's also port to the ChibiTronics love to code and there's a talk tomorrow by Bunny at 4 p.m. You should go see it, it's completely amazing. This thing had two USB pins, two pins on it that when this was developed we didn't know if we could have a BitBank USB stack at all, but it is special in that it gets its data loaded over audio and it's really cool. So we load the USB stack, it's 4K, it loads over audio, it's fantastic. You should go to the talk tomorrow, 4 p.m. One other thing, what about multi-threading? Well it turns out that that works. For the love to code product, we actually have a USB polling thread running in its own thread with another operating system, ChibiOS, running in the background. It occasionally skips a packet, but that's fine because remember USB you get three tries. So it works relatively well I think. For future work, I would like to get a bootloader working, an updater working. There are USB hit updaters in existence. Microchip makes one for their chips. There really aren't any very good, there's one open source, one that I'm aware of. There is the DFU updater, but that requires drivers on Windows. I'd like a more full functional updater before too long. It would be nice to get a GUI for the updater as well. And I'd like more platform testing because right now I do a lot of development on Linux and Windows, and occasionally I have access to a Mac. It would be nice if I could test more host devices, but it would also be nice to get the port to more CPUs, more than just the two Kinetis chips that we have access to. And just kind of a personal thing. I'd like to produce Palawan hardware myself, but the most important thing is I want to get a functioning updater working because when you write software, you make mistakes and you wanna be able to update it. And updating the updater is a bit of a challenge, so you want the updater to be as rock solid as possible because I wanna be able to send somebody a copy of the binary so that I know it works for them if it works for me. Thank you for listening. I really didn't know that USB could be utilized in so many different ways, so it was really great. Now we have time for some questions. A few questions, okay. Yeah. So if we have any questions from DeSalle, please take up the microphones in a nice row and... Got one? Yeah. Yeah. Hello, first of all, thanks for your talk. On one of the important and not so important slides, you mentioned that you ignore the address. I can perfectly imagine that this is working on ports directly connected to a host. Did you also test this on different brands of USB hubs because I could imagine that to go horribly wrong? The question is, I ignore the address. Right now there's a dimension that it's not important. It may be that it's fine. It may be that we can't ignore the address. It may be that we cannot ignore the address. I actually do capture the address because it's just another shift and another store. In my experience, it's been absolutely fine, but the provision is there to get it working if it is a problem. Awesome. Questions from microphone number two? Hello, I have a question. So I'm wondering if it's simply the amount of cycles and the speed of the Cortex-M0 plus processor that's limiting from implementing this in high speed or is high speed USB significantly more complicated? High speed, I think the question was if it's the number of cycles that's limiting implementation of, I think you mean full speed. Full speed is the 12 megabit version. High speed is 480 megabits. Yeah, I'm asking about high speed as well. High speed of, okay. All the way. Full speed is actually a bit of a challenge and the reason why full speed doesn't work, I'm gonna say this in the hopes that somebody proves me wrong, but full speed is impossible on this chip. I really hope somebody proves me wrong. But the problem is the load, the sample takes one cycle. The check to see, the shift takes another cycle. The check to see if you need to exit takes a cycle and then the add one to store to the next value takes a cycle. So that's four cycles right there and you don't even get to any of the bit stuffing and anything like that. So 48 megahertz divided by the 12 megabits per second gives you this number of cycles. If you had a DMA engine, it may be possible you could do stupid DMA tricks, but these chips are so low end that they don't even have that. You basically get a CPU and that's about it. So I don't think full speed would be possible. High speed I think would be right out at this point. Also because high speed has the additional problem of you need to switch voltages. So it's no longer just running the pins directly to the CPU. You now have to have level translators and things like that, so. Cool. Questions from the internet? Yeah, I got one question from the internet. Did you try to use SIGROC for a USB debugging as it is open source and works with some pretty cheap hardware? That's not a question. SIGROC. Questions did I try using? No, I've never actually heard of this. So I've never tried using this particular platform. The only debuggers I used were the Beagle and the OpenVisula. I've never heard of this rock, but I'm gonna have to check it out now. I was wondering whether it's not possible. You're using polling right now, using a timer interrupt I imagine, whether you could use edge detection for an interrupt vector. The question is whether you could use edge detection for an interrupt vector. Yes, you can. And that's actually how this is designed to work. Let's see if I could do this. Yes. If you look at the API, I forget where it is. If you look at the API, it's designed so that you hit an edge, an IRQ, and then you call the capture function. But it's kind of a top half, bottom half thing. The top half deals with handling the interrupt and sending the response back. The bottom half is designed to be running in this pulled loop. So yes, you do use the interrupt edge detection to capture the signal and then send a response back, but you do all of your processing in the main loop. Questions from over there? Thanks. How did you do the timing recovery? You said you had plenty of time at the preamble, but accuracy and having plenty of time kind of contradict themselves. I'm actually kind of sad that I had a slide where I described how that happens. You basically synchronize to the pulses Let's see, where is it? So you get these pulses at the beginning. And, right. Well, it's here somewhere. So you count the cycles between the... Not even. You, at the beginning of this, this up, down, up, down, what you do is you continuously in a, it's about a four instruction loop, which is very similar to what it would take for full speed. You keep comparing it with itself and looking for it to change. And as soon as it changes, you know you're synced up with the beginning of the pulse. And from then on, you assume you might drift a little bit, maybe, but it's only 11 bytes. And so it ends up working just fine in the end. Thanks. Questions from microphone number two? Can you build a USB hub with your software? Can I build a USB hub with the software? USB hubs are defined in chapter 11 of the spec, which is about as many pages as chapters one through 10. The other problem is it would be low speed only. And USB requires that a hub handle both low and full speed. So I don't think that would be possible not without changing the chip and changing the PHY, but it would just involve handling more PID types. So it would be possible with a faster chip. Questions from the internet? I got two short ones. The first would be, do you think higher speed USB would be possible with an external PHY? And what would be required to port it to Cortex M3 or M4 processor? So what was the first question again? Do you think higher speed USB would be possible with an external PHY? PHY? PHY. I think it would be possible. I mean, they do have Cortex M0s and M0 pluses that have a USB PHY built in. And so with an external PHY, of course, it's possible. I mean, something that actually does the decoding and streams it in, I think it would be possible. The second question is, could it be ported to an M3 or an M4? The cache makes things a little bit weird. And I can't remember if M3s and M4s have caches, but they're faster. And so it definitely could be ported to an M3 and M4. You just have to insert weight states. In fact, with the faster chip, you might be able to do full speed even. So that would be an interesting port. Yeah. Could you maybe elaborate a bit on writing to the human interface device? Especially I'm curious if on Windows, if it's required to install an INF file or something like that? No. In fact, with writing to a human interface device, the question was elaborating on writing to a human interface device. In other words, using a HID class device as an output device. And there's a wonderful library that I use that is available for just about any programming language out there called Signal 11 HID USB, I think it's called. And it abstracts away. It's one library that runs on Windows, on Mac, on Linux, Linux using HID USB and Linux using LibUSB. It handles all that for you. You just need the right kind of descriptor that basically tells it the vendor format is eight bit frames, eight of them total. And you just write to it using the Signal 11 library. On Windows, it does not require a driver at all. And so I think that's the ideal way to do it for Windows. Yes? Hi. You've talked about prerequisites for using this software in different operating systems. But how about using it in BIOS stage? Can it emulate a keyboard while in BIOS, for example? The question was, can it emulate a keyboard or a mouse, actually, in the BIOS? And that is defined in the header of the USB HID descriptor. There is a boot protocol field in there that a lot of times is set to zero. If you set it to one, it acts as a USB keyboard that the BIOS can then understand. It's a very simplified version of the USB HID interface. It doesn't really pay so much attention to the number of keys and all the descriptors. It just assumes that it's a PC 101 keyboard that is designed for use in the BIOS. So yes, if you set this flag, then it will work. Thanks. Three questions left? Yes? Microphone angel? No? Internet? OK. So I totally forgot to ask you to give Sean a big round of applause.