 Okay. Can we start? All right. Then please welcome Drew Fisher. He's telling us about reverse engineering USB devices just like the Kinect I see standing there. Maybe you can elaborate on that later. Please welcome Drew Fisher. Buddy, I'm really excited to be here today and talking about reverse engineering USB devices. So I was on the group that reverse engineered the Kinect protocol and in particular I worked on the Kinect audio protocol. So I'm just going to talk a bit today about how that went, some things that I think were helpful from the USB protocol that all these protocols are built upon. And then some ideas for what better tools of the future might look like to help reverse engineers like me. So who am I? I'm Drew Fisher. I am a grad student at UC Berkeley. I maintain libfreenect, this set of open source Kinect drivers built from reverse engineering. It's at the URL you have right there. Today I'm going to talk a bit about why we want to reverse engineer USB devices and protocols. But the basics of USB that you need to understand to get the benefit of building off of USB, case study basically of the Kinect audio reverse engineering and then my vision for better tools and some Q&A. So why do we want to do this at all? There are a lot of USB devices out there. Pretty much every new device, every new peripheral for computers that you're getting is USB. So some of the really cool ones are the ones that haven't had devices of their kind before. So things like depth cameras haven't really been widely available. Microphone arrays that do fancy things with noise cancellation, that's not really something that we've developed a wide spec for. So the more unique a device, the less likely it is that the vendor has provided us with a free and open source driver. So the less likely we can use it with all of our hardware and operating systems. So we'd like to speak the same protocol as the device and it bears repeating that these are all built on USB. USB is an abstraction layer for what we're doing where we know that we're speaking the same language at that level and so we just have to figure out the next one and then the next one and then the next one since, as we know, protocols tend to stack on top of each other. So to do this, we can use the standard black box state machine view and we know that when we plug the device in, it's in a particular state and then the driver needs to track all of the state transitions that happen throughout the course of talking to the device. So then to actually do anything useful with the data that we receive from the device, if it's an input device or to know how we should format our output to the device, if it's an output device, we need to understand the structure of that data. So the obvious thing to do here is these are devices that already have a driver working somewhere that are used to interoperate with that device. So let's just watch exactly what's said back and forth between the host and the device when they're in normal working order and figure out what's the data, what's the state and how we can encode our own data or state accordingly. So USB, it's worth noting that the hosts and the device are separate concepts and that the host is in charge of all communications. If the device wants to say something, it has to wait for the host to ask. You can tell which end of your device is the host and the device because this end plugs into the host and the one that's more blocky plugs into the device. So devices have, each device may have multiple configurations and each device may have multiple endpoints which are separate data streams. They're usually unidirectional and they are important to the point that I'm going to talk about them briefly. So you have control transfers, you have interrupt transfers, you have isochronous transfers and you have bulk transfers. And they're used in different situations. Control transfers are the baseline for how USB devices talk to and from the host. It's how you get the name of the device, its configuration, vendor ID, product ID, all these things come from control transfers. And they're usually used for basic signaling information. We have interrupt transfers which are designed to be a way to allow the device to notify the host that something important about the world has changed. So what happens is if you have an interrupt stream, the host will ask the device every so often, hey, do you have any new information and then the device will say yes or no. And then from that you can know that you can tell the host from the device when something of interest has happened. So this is what's used for keyboards and mice, other human interface devices if you have one of the car simulators, it can do force feedback since you can also do host to device. So that's what these are usually used for. There's isochronous transfers which are really useful for time sensitive data. In particular cases where you don't really care about getting every single packet, it's okay to drop a few as long as everything keeps in sync with real time. So there are video class and audio class specifications that build on top of these. And the neat thing about these is that they're guaranteed bandwidth. When the host sees that the device has an isochronous endpoint, then the host will allocate all of the bandwidth that that endpoint could potentially use when it tries to do scheduling of bandwidth on the bus. And then you have bulk transfers, which are generally best efforts whenever you have other bandwidth that can be scheduled, it can be allocated toward bulk transfers. These are used for generally large data transfers. So this is the type of transfer that you'd see appear for mass storage devices. So disks, flash drives, and so on. It provides reliable delivery with retransmission of the packet if the first submission fails, provides error checking with CRCs. So we can guarantee that this type of packet will arrive reliably eventually. So why was all this useful? Under normal operation we have the host and the host has a driver and it's tracking the device's state. So everything that is changing about the world is encoded in these transfers that go back and forth. State changes require reliable delivery. Streaming real-time data does not. So you can start to guess which kind of transfer is going to be used for which kind of data or which kind of state change. So now you've got a USB device. You want to make a new driver that works with it. What do we do? This is kind of what we have so far. You take the working system, you stare at it until you understand how it works, and then you write the driver. But this is kind of opaque. That two could be replaced with magic and that three with profit and you'd all think the same thing. So here's what we did for the Kinect. There are various ways to capture the transfers that go back and forth. You can get in-line USB sniffers. There's a commercial product called the Total Phase Beagle 480. It costs $1,200. I don't own one. There's an open-source project called OpenVisuala. Yeah! You guys are awesome. I'm excited for whenever I'm actually going to get mine. Because it will make doing this sort of thing a lot easier for me. So that's on the hardware side. And if you can't actually run code on the host device, then this is what you're left with. You have to simply sniff the transmissions on the wire by being plugged in in-line. Alternately, you can push that down into the OS that you're using. On Windows, there's a program from Google code called Bus Dog. I think it's making fun of an older product called Snoopy. For some reason, all of these are named after dogs. I don't know why. I guess something about sniffing bloodhounds, whatnot. But yeah, so that installs a USB filter driver which allows you to log everything that is said to or from a particular device. So be careful with this on 64-bit because of code signing issues. Just read all the details before you dive into the install. On Linux, we have USBmon, which is a built-in thing to do effectively the same thing. So we can get data. What's next? We're trying to understand the data. So we can download Total Phase Data Center and the USB dumps that's the awesome folks at Adafruit Industries made for us. They posted them on everyone's favorite binary distribution site GitHub, which was odd for large binary data. But so we can open this and then we can start reading through all of the transfers. So I have this open over here. It's probably completely illegible from however far back you folks in the back row are. But so this is the data center with that packet capture opened and filtered for just the connect audio device. So it starts out with standard configuration things. And then it starts doing data transfers. This is on a bulk endpoint. And so we start thinking, OK, how do we understand this? We want to look for patterns. And so there are a lot of problems that developers face as they are trying to design these things to work reliably and well and extensively and be able to support them over time. So they come up with some common problems. And it turns out that there are some fairly patterned solutions to each one of these. So protocol versioning. You put some magic bytes in at the beginning of every transfer. And then if it's a particular set of magic bytes, then that matches a certain version number. So over here, we have the first four bytes of this transfer are 09202.06. And the same for this one. And if we scroll through, the same for this one. And so this appears it might fit the bill. And even more so, we can think of what these might mean if we flip little endian. Then it reads 0602 2009 or June 2, 2009, which is the day before Project Natal was announced, which was the code name for the Kinect. So we have this magic byte sequence appearing of packet fragmentation and reassembly. Generally, USB won't let you do a transfer that is one megabyte large. You have to break it apart and reassemble it. How do we do this with TCP? We're going to do the same thing. We have length bytes. And then we have sequence numbers. So in this transfer, it turns out that these four bytes are little endian length for how many more bytes am I going to send you next? So that's hexadecimal 400. And then sure enough, we follow up with 32 512 byte transfers, which sum to that actual value. What else is going on? We have this number. It's 2 in this transfer. It was 1 in the previous one. It's 3 here. And so if we look at all the data, we see that it's incrementing by 1 every time. These are sequence numbers. And in fact, in the replies, we get in the same byte, we get the number that we sent last. So this is matching messages with replies. It could be possible that some vendors would issue multiple commands and then receive multiple replies asynchronously. And they might happen out of order. So this is a fine way to reorder things just the way that we would with TCP. Timestamps for audio or video data, sometimes you'll have a clock on the device that you want to keep things synchronized with. So you may see things that are increasing at a rate that is commensurate to the increase in time. So these things are fairly common. And we can look for them. So we come up with the first thing the Kinect does is it receives a blob of firmware every single time you plug it in. So each command, it's got those four magic bytes at the beginning. It's got a tag, which is that sequence number. It's got a number of bytes that it's going to upload. It's got a command. Since the bootloader, you can both send it data that is part of the firmware and then until it starts running from a new execution point, those are separate commands. And then it's got the address at which it should place this new data or jump to. So it does all that, and that's one of these commands. And so you start building up this repertoire of various command structures that you find in these transfers. And hopefully, you eventually build up enough that you understand everything that's going on in the protocol, or at least enough that you can replay the bits that you don't fully understand just the way that the actual driver did. So it turns out, if we switch over here to looking at the audio device after it's rebooted, we wind up with a bunch of transfers in. These wind up being the same magic bytes that we talked about earlier. You wind up with a bunch of 524 byte transfers, of which you've got a 12 byte header. And then 512 bytes of data. And the header looks like this. And so you can, taking this data, this giant data set that you have from this USB log, I wound up throwing it into Python and then tinkering with ways to transform the data there, different possible interpretations and structures. But this seems essential to any understanding of data. You just tinker with it until you see what makes sense and what works. And you want to test different possible interpretations. You want to see if certain invariants hold true, like the sequence numbers or perhaps you have a set of length bytes that tell you how long the rest of the packet is going to be. You want to see if these things hold true always so that you can continue considering them or if you can find an example of where, no, that interpretation is wrong. You should try something else. So while armed with all this information, you can then write a driver that does the same sort of transfers that you've observed the actual driver doing. LibUSB is a library that allows you to do USB communications in user space. This is great for prototyping because it's painful and annoying to write kernel drivers and reload them and hope that you don't accidentally crash your machine or something in your development cycle, particularly when you don't know whether what you're doing is going to work or not. So LibUSB is a handy tool for prototyping. So now I'm going to give a brief demonstration of what works. So what I did right there was I ran a program from libfreenact, the open source connect drivers. And what it does is it uploads the firmware as extracted from that USB log. We later found that this matches byte for byte firmware from an Xbox 360 system update. And what you see is indeed my voice. It's got four microphones, there's one here, and then there's three on this side. The connect can actually receive what's playing out of the Xbox's speakers. And it receives that stream. It does noise cancellation. It has a calibration sequence where each one of the microphone's impulse response from each one of these speakers is computed. And so with that, it can subtract out what the speakers are playing so it can still hear you say connect while you're playing your game. And so you find all these things from the logs. Interpreting different sets of bytes as let's try 32-bit little indian floating point numbers. And then you see that the calibration data looks like a spike followed by a quiet period. And that's what you'd get if you had this ideal single impulse versus what actually happens to the microphone. So given all this, what can reverse engineering tools help us to do better? The core of all this was that it's a human who's figuring out what the meaningful structure is in all this. So we want to help the human notice these sorts of patterns, particularly the common ones that I mentioned. We have reusable problems and reusable solutions for each of them. Our tools should be able to help us find those when they're being used. Things to find possible packet length bytes or sequence numbers or things like that. These are things that data analysis tools should be able to help us with. Next, we want to help people test their hypotheses about possible interpretations against all of the data that they have available to them. Data on USB flies way faster than we can reasonably keep in our heads. You can't think of something and check every megabyte individually. It doesn't scale. So what we need our tools to do is help us deal with the larger amount of data that we have. So what would work well would be something to allow us to specify a structure and then perhaps an invariant and then test if that invariant holds. These things would help people come to better understandings of the data that they have. And then third, this was one guy staring at hex dumps. One guy may not always come up with all of the clever revelations that a protocol might encode. So let's help a guy and a girl or several guys work together and let's make some collaborative tools. We still want to be able to pivot easily on all this data. So let's make a read eval print loop that you can use on the web with other people, with namespacing so you don't clobber each other's data. But something like that where you could collaboratively reverse engineer these things instead of doing it all on your own. So that was my experience. Are there any questions? Thank you very much. I think we have time for about one question, so let's make it a good one. Who has a good question? No pressure. You? OK, I'll get the mic to you. Thanks. So how similar is that protocol to the normal regular audio protocol using USB? Similar in some regards, in others, not so similar. So you wind up with the artifact of you have to time stamp things back and forth between the host, which is sending whatever is playing out of the speakers at that point in time against whatever it's sending back because there is some latency in all this. So you have to calibrate for the latency. And these things are something that you generally don't have to do. You just spew the data best effort. OK, thanks, everybody. That's all the time we got, unfortunately. Thank you again. Thank you all. Drew.