 I want to answer three questions about spy spy, which is this my open source spy flash emulation. The big question is, what is the spy flash? Why do we want to emulate it? And how are we going to do it? So the what is the easy one? The spy flash are these small, non-volatile memory chips on servers and laptops, most computers, that are typically in 8-pin SOIC packages. And pulling up the data sheet for them, we can see that spy is the serial peripheral interface. It's a really generic term for basically three or four wire connection. They tend to be pretty small, only about 16 megabytes or so. And frequently they're called the boot ROMs. That's a bit of a misnomer since they're actually flash memory. And these have replaced the 64 kilobyte ROMs in the old machines. And that's allowed us to bring a lot more complexity into the firmware of our systems. You might be familiar with the closed source UEFI firmware on most computers that gives you typically windowing and networking and a bunch of other functionality before the OS is loaded. There are also open source firmwares like Linux Boot and Core Boot that allow people to replace the firmware in those spy flashes with free software that they can read and modify to suit their own needs. So why do we want to emulate the spy flashes? Well, if you're working on Core Boot or Linux Boot or if you're doing security research into UEFI, you end up having to reflash the chips quite frequently. And really, you end up doing this a lot. In my career doing firmware research, I spend a lot of time hooking up flash programmers and waiting for the chips to flash. And you might think, well, it's only 16 megabytes. How long could it take to write this into the flash memory? And the problem is that these chips are designed where you have to write them four kilobytes at a time. And you have to erase the sectors before you can write. And in the worst case, it takes 120 milliseconds to erase that sector, each sector. So if we 120 milliseconds times 16 megabytes to 5 by 4K sector size, we're talking about eight minutes to write one of these chips with new contents. And that's a really painful cycle time. So to give you an example of what my usual day is with this sort of system before spy-spy, you're finally finished building core boot. And you've got to turn off the power to the machine you're going to flash because the spy bus doesn't allow multiple devices to drive it. And then typically, you have to attach the flash programmer each time because you can't leave it connected due to loading and capacitance issues. And then you can finally run the flash program to start flashing. The first section goes really fast. But that's because you haven't changed the management engine region at all. So it can just skip that. And then it hits the parts where it has to start erasing and rewriting. And yeah. And the minutes go by. It's still 28 kilobytes a second. It's just, yeah. Five minutes in, at some point, you're ready to go off and do something else. So when it's finally done, after five and a half minutes and 49 kilobytes average speed, you're not actually done. You now have to remove the flash programmer, power the system back on. And then you've got to do it all over again when you realize that you made some stupid mistake in building the image or something. So this is really painful. And as I say in the infomercials, there's got to be a better way. And that's what the spy-spy device is for firmware developers. You attach it once when you start doing firmware development. And then you leave it attached. When you go to upload the new core boot image, and we're going to play this in real time as we upload the image, it's able to copy the new core boot into the DRAM on the spy-spy and basically limited by the speeds of the USB. So 12 seconds versus five and a half minutes, that's not too bad. And we can actually do a little bit better because right now we're limited by the USB serial, the abstract communication module or ACM. But this is still just such an improvement. The other advantage is we can now do a soft reboot into that new firmware. We don't have to get out of our chair to remove the programmer or to turn the power supply on and off. So it's a huge, huge improvement. The other thing spy-spy gives us is insight into what happens when the computer boots. If you've just come to camp from the 1970s, you might think x86 is still boot in real mode, reading from the top of the flash memory. But it turns out that is not what happens on modern CPUs at all. Since spy-spy is watching the bus, it's able to give us sort of like a TCP dump style list of everything that is read from the flash. And we can see that the Platform Controller hub reads something from the Intel Flash descriptor first from offset 10. We can see the Intel management engine reads and validates its firmware. We can see that something in the x86 maybe some boot code or some boot ROM or perhaps microcode reads and parses the fit table, the firmware interface table, which contains pointers to microcode updates that the x86 is then loads. And then finally, it's still not time for the reset vector. It jumps into boot guard or some of the other secure boot methods. And if those validate the signature, finally, the reset vector gets called, and then that does a jump into the BIOS or core boot or whatever. This is a lot of really useful insight. And we could also plot this data. So if we make a plot of the addresses versus the order in which they're red and color them blue for the first time it addresses red, this is typically when something like boot guard of the management engine is doing a signature check. So we would call this time of check. And then we can also color any addresses that are reread from the flash. And we can see that there are quite a few other time of use reads. This talktow issue can turn into a security vulnerability, which is what Peter Bosch and I demonstrated at Hack in the Box earlier this year. We were able to bypass Intel's boot guard through using technology similar to Spy-Spy. So that's the what and the why. And because this is a technical conference, let's get deep into how it actually works. And I want to point out that I say it's my project, but it's built on a lot of really wonderful other open source projects, including the USIS Project Trellis and XPNR, which was just talked about in the previous talk here, that this group has produced an open source tool chain for FPGAs that's created an entire new ecosystem of programmable hardware. Also had some collaborators from RevSpace, Alyssa Milburn and Peter Bourne who worked with me on the initial FPGA implementation, which is what we used for the Hack in the Box demo. That was built on a much smaller FPGA, the Ice 40 up 5K, which was also mentioned in the previous talk. This one has one megabit of block RAM. Not enough to store an entire flash image, but it stores enough for the time of check, time of use vulnerability. But we knew we were going to need a lot more memory. Luckily, there is a really neat open source hardware project out of the Radiana Hacker Space in Croatia. Emard has built this one on the Lattice ECP5, and it includes a 32 megabyte SD RAM that we can rewrite to. Around 250 megabytes a second. So we're able to benefit from the fact that this has the schematics published and it works for the open source tool chain. And we can use this as the building block for our system. The next hurdle is that SD RAM is really complex. And it's filled with a lot of dark magic inside the state transitions that you have to maintain. That's not exactly the state diagram, but it looks pretty similar to the real one. And I don't know about you all, but I don't want to have to understand DRAM at that sort of level. And luckily, I don't. Stefan Christensen published under a very permissive license SD RAM controller that we were able to very quickly adapt to the open hardware, ULX3, and build it with the open source PGA tool chain. And that gave us a huge jumpstart. The other thing is that most open source projects build on things that other people have published. In this case, Scanline had already done a similar project for emulating Nintendo DS save games. And she published all of her source code. So that was really helpful to learn about how the spy bus works and how to interface with it. So let's do a quick dive into how the spy bus works. As I mentioned, there are these 8-pin SLIC chips. And we can typically find in the datasheet the pinout and a timing diagram. There's a red dot that indicates the pin 1. And we can then find the power on ground. We definitely don't want to mix those up. The chip select line is the next one that's important. And this is used from the Platform Controller hub or the x86 to tell the chip that it wants to talk to it. And it goes low during the duration of the transaction. So we would call this a active low signal and typically designate with either a bang or a hash in the name. The clock is also generated by the x86 and fed into the spy flash. On the falling edge of the clock, the values change. And then on the rising edges, they need to be stable. So we would call this a rising edge clock signal. The serial in pin comes from the x86 into the chip and contains the command bytes that are going to be processed by the chip. And at the end of the command, the flash will write its output onto the serial out pin. So these four pins are the ones that we need to control. And we need to understand what do the command bytes look like. So the datasheet lists a lot of commands. And the one that we most care about is the normal read. So there's three. And it tells us that there's going to be then three address bytes that come after it and that the chip will output in bytes until CS goes high. So we pull up the timing diagram from the datasheet. We can see there's the command three on the serial in pin, followed by 24 bits in most significant bit first, followed then by typically up to about 256 response bytes. In the talktowel version on the I-40, we basically said when we've received all 24 bits, read something from the block RAM and send it out this by port. But that worked there because block RAM is available in a single cycle. When we tried this with the DRAM, it didn't work. That the real RAM, excuse me, the real spy flashes in blue. And you can see that the first bit is delayed by about 50 nanoseconds before it takes on the correct value. And the reason for that is SD RAM isn't just a array that you can read from. You have to activate a row. And then you have to wait some number of clock cycles. And then you can send the column that you want to read from that row. And you have to wait some additional clock cycles. And this total latency can be five to seven clock cycles, typically 50 to 80 nanoseconds. It's really fascinating that even if we went to faster memory, this random read time doesn't change because the cast latency goes up or the row activation latency goes up. That even with 2400 or 2.4 gigahertz RAM, it's still about 50 to 100 nanoseconds. And the problem is that we need to have that result ready roughly half a spy clock later. And the spy clock is around 20 megahertz. So that's about 25 nanoseconds that we need to produce a result. What's wonderful about working with FPGAs is we're not limited by things like byte boundaries or limitations of traditional programming languages. So we can start that row activation as soon as we've received 14 bits of the address. We can then start the column activation once we've received another nine bits of the address. We then get 16 bits back from the RAM. And we've been able to overlap all of this with the spy transaction. So we can use that last bit to select either the upper or the lower byte of the 16. And this actually works. We're able to produce the first result, excuse me, the first bit, only a few nanoseconds slower than the real flash. And at 20 megahertz, we meet timing and we can convince the PCH, the Platform Controller Hub, that the flash is good when it sees this F8, excuse me, 5A, A5 data, which is at offset 10 in our flash. Zeno and John Butterworth tell us in their advanced BIOS training that this is the signature that the PCH is looking for to identify the flash. And if you don't have it, the system won't start up at all. It's just no sign of life whatsoever from the machine. So we can convince the PCH to start up, but sometimes the Linux kernel or the boot guard or other things fail, typically with some sort of page read error. And the reason for that is another complexity in the DRAM. DRAM is actually built out of lots and lots of capacitors and those capacitors are slowly discharging. So it's necessary for the SD RAM controller to periodically refresh each row. And this means that every 7.8 microseconds, the SD RAM controller will prevent any reads or writes from happening, start a row refresh, and then nothing can happen for about 60 nanoseconds, which is going to completely blow off our timing. Luckily, our SD RAM controller is open source. So we're able to modify it to add something that will inhibit the refresh when we're in a timing critical section during a spy read. And with all of those hacks together, this actually works, that we're able to support boot all the way into Linux. So you've seen in a lot of the photos, we're using these software-less chip clips from Pomona. And you might be asking, how do we prevent the real flash chip from responding to these requests? It turns out that on most mainboards, but not all, definitely check yours before you try this, the CS input on the spy flash goes through a small series resistor so that it's sort of buffered from the PCH's CS output. This means that we, this resistor means we can build essentially an OR gate where either the PCH or R device can drive that line high. So schematically, it looks something like this, where when a spy transaction starts, the PCH or the XA6 drives CS low, which wakes up the spy flash, and it will start to send data back on serial out. If the FPGA wants to take over this transaction, it can drive the line high, which will turn off the spy flash. So the spy flash will turn its serial output line driver off, and then the FPGA can assert the serial output line to send the data in reply to the PCH. The drawback to this is we now don't know when the transaction is over. So we've added another hack, which is that the FPGA is watching the clock line, and when it sees that some number of nanoseconds have gone by with no clock transitions, it goes ahead and de-asserts the, excuse me, it tries to say it's CS output, which then returns the bus so that the PCH can take it over again next time. So huge pile of hacks, but it all works. We can boot quite a few laptops. We've done it on a lot of servers, and pretty much every system we've tried it on, we've found interesting talktel vulnerabilities. We've also been able to accelerate the firmware development for the groups that are using these machines. One area where we really want to do some research is supporting other architectures. On that server, for instance, there's another spy flash right next to the X86s that stores the ARM BMC, the board management controller firmware. And if you want some more information about that, you can watch my CCC talk from last year about the SuperMicro BMC hacks. Unfortunately, the ARM uses some of the read commands that we don't currently support, which brings us to, we would love for you all to get involved in the project. If we can add some of the things that ARM needs, if we can add these different read commands, we can definitely improve the UI. It's very programmer-centric right now. We'd also love to switch to either USB mass storage, which was mentioned in the last talk, or perhaps USB ethernet. And there's also a lot of other buses in the system that we can apply this to, the LPC where the TPM lives, the eSpy is being used by things like the Apple T2 co-processor, and also things like the MMC for doing firmware loads on embedded devices. So hopefully, this has answered your questions about the what, why, and how for a spy-spy. If you want to get involved, you can check out all the source from our GitHub tree. We have a fairly active Slack channel as part of the open source firmware Slack, and you can also find me on Mastodon or Twitter. And with that, I'd love to take any questions that you all might have about the project. Apologies about that. There's about 15 minutes left for Q&A, so go ahead. RAM chips that will work, are they all too small? Most of the SRAM chips tend to be pretty small, also very expensive to build 32 megabytes of SRAM these days. For Scanline's project, she used SRAM because she only used to emulate, I think 128 kilobytes of storage. Hi, thanks for the talk, very interesting. Usually there is a command to erase the whole flash, which is much faster than sector erase. Was that an option, or is it still too slow? It's still pretty slow because, well, a few reasons. Okay, so the whole chip erase command here is 80 seconds. Typically that is not an optimization because a lot of the chip hasn't changed. That would erase the management engine section, for instance, which, unless you're working on security to the ME, is probably going to be the same between flashes. So a full chip erase would require a reprogram in all of that. And the page program time is 1.5 milliseconds per 256-byte page, so I think overall it's a loss. And the second question, and serial man flashes, is that a thing in these controllers? Typically not, and I'm not sure why. There is a lot of interesting innovation happening in some of the other chips, excuse me, some of the other systems. For instance, Apple is using eSpy to boot their systems now and they have a security coprocessor that is acting much like the SpySpy as a interposer between the X86 and the firmware. So the T2 is able to do all the signature validation prior to releasing X86 reset and allowing it to read that data. Okay, next question. Hi, you had an Iced 40-based design before you moved to this design. What type of performance were you able to get out of that older version and what precipitated the move to the ECP-5? So the Iced 40 was able to keep up just fine with the 20-megahertz SpyBus. It seems that the ECP-5 hardware has a much easier time of meeting the timing requirements. We need to be at least six times the Spy clock speed to interface with the SD RAM. Because we need to be able to overlap all of the SD RAM row activation and CES latency. We couldn't get the Iced 40 to ever time even close to that. It was a challenge getting the Iced 40 to time at 48 megahertz, which was just enough to keep up due to the, I think, the two-clock clock crossing domain delay. Is the access pattern of the predictables that you maybe use, Keshin? So we looked into trying to do that and it is predictable in the case of a given firmware, but not necessarily if you're doing firmware development where you're changing that quite frequently. The other thing that we looked into was using on the Iced 40 up 5K with one megabit of block RAM. We could cache every first bit for an eight megabyte, excuse me, for an eight megabyte ROM. Unfortunately, that's not quite big enough for a 16 meg ROM and a lot of the systems that we're looking at now have 32 meg spy flash chips. And we really didn't want to move to any of the proprietary FPGAs. I'm really appreciative for the work on USIS, Project Trellis, and next PNR. It's really made FPGA development so much nicer. Oh, all right. So you might've talked about it, I might've missed it, but you talked about graphing the existence of talktow bugs. Were you ever able to implement them in the firmware and do differential swapping out of the firmware in the tool chain? Yes. We were able to defeat Intel boot guard through that. Peter Bausch and I gave a talk at Hacking the Box about it that goes into a lot of the details. That particular vulnerability had to do with the fact that there's a brief window of opportunity when the CPU transitions from cache's RAM mode to start executing from DRAM that they have to disable caches, which means the next instruction gets fetched from the flash, which then turns the caches back on. So we had a one instruction window that gave us a total boot guard bypass. Okay, any further questions? I'm sorry, I might have missed it, but does the project support other instruction sets other than the SPI flash which you demonstrated, like from Wintop or any other manufacturers? It supports a lowest common denominator of spy flash commands. Basically the Jdeck read ID command, the zero three normal speed read, the serial flash descriptor protocol, I don't know SFDP, which modern Intel chipsets require to be able to boot from the spy flash. It also has preliminary support for write support and page erase support, although right now we're kind of winging that. Patches are always welcome. Okay, next question. You've described that some mainboards have the serious resistor, which allows you to override the chip select. What would you do if a mainboard doesn't have one? If the mainboard doesn't have one, it's necessary to desolder the chip select pin, that pin number one from the flash on the board, bend it out of the way and then put a jumper underneath it so that you can essentially break the connection between the flash chip and the main board. Any additional questions? Signal Angel, none? Okay. Well, in that case, we're done. Thank you all. All right, thank you all.