 Hello and welcome to this talk on ChaosZoneTV or TV here on the RC3 2020. This talk will be about FPGAs, Field Programmable Gator Rays, and it will be given by Pepin de Vos about his experience in documenting the Go Win FPGAs. Have lots of fun and enjoy the talk. Hello everyone, I will be giving a presentation about how to fuzz an FPG, but the short answer is it depends. So I will be mostly talking about my particular experience and hopefully someone gets something interesting out of it and try to apply it to their own interests, whether that is contributing to this particular project or starting their own or contributing to another one. I am Pepin de Vos, I am a software developer and IC designer and my mission is to make better open source software for chip designers. I have changed my website and to link to the project what I will be mostly talking about which is about reverse engineering the software of the Go Win FPGAs and I have also linked my Patreon and get the sponsor page which helps me do this. So I want to thank all the people who are already supporting me which is mainly SymbioticUDA where I did my internship on this project and also these other kind of people so I will be talking first about some background about how an FPGA actually works on the inside, a bit about the open source tools that are there that we will be working with and then about documenting these FPGAs through fuzzing and decoding and why I say documenting here and not reverse engineering and then some results. So an FPGA, well programming logic devices in general, you start with some hardware description language generally mostly VHDL or Verilog and then you do synthesis which goes through to a net list of logic blocks such as N and OR gates and other things and flip-flops which are sort of the memory, the state of the program and the way this is implemented on an FPGA is through lookup tables and multiplexers where the lookup table is basically a piece of memory that stores the true table of certain logic logic operations and then there is this flip-flop for the memory and multiplexers which are used to route these connections to neighboring tiles because inside the FPGA are sort of a whole grid of similar tiles with some exceptions for special purpose blocks for DSP and memory and those kind of things but the core is really like logic tiles which are generally a collection of lookup tables and flip-flops which are all together a slice and then there is these routing multiplexers that connect pairs in outputs together and the way they connect is by intertile wires mostly and so this tile has connections to one, two, four, eight tiles in any direction neighboring to it and you can connect anything to anything almost like so not anything to everything they try to kind of optimize how much you can connect because yeah if you would connect everything to everything it's a lot of multiplexers and it would take a lot of area and already these multiplexers take off the majority of the slices so the FPGA designers really try to optimize how much multiplexers they can get away with. They also generally see that the more advanced FPGAs have actually less multiplexers because they have better software so let's take a look now at the software that you generally had how the process works from VGL to the Bistri Bicelot on your FPGA and of course the first step is the programming language and then there's these synthesis tools like well for the commercial ones it's Quartus, Vivado, ISC and then the open source ones. The popular one is like the YOSIS which uses ABC for optimizations and then this outputs in that list in various formats that are supported by the tools and then there is a place-in-route step where you map these logic elements to a specific location on the FPGA and connect them and in the open source tools this is sort of separate where you have this, you take the sort of assembly language of the FPGA and generate it to an actual Bitstream but in the commercial tools this is generally one step where you just input your netlist and you get a Bitstream out. Yeah so this is kind of the things that need to happen if you want to have an open source tool flow for your FPGA. So yeah the first step is synthesis which is in this case YOSIS and this is well it's work but it's just you know software. You know you can open the documentation from the vendor and you see all their primitives and you write synthesis for them and it works. Like there's no digging involved into the software really or like figuring yeah it's just I don't want to get too much into it but it's you know a work but just writing software and but for place-in-route the vendors generally don't really tell you how exactly the FPGAs work inside so it's a little bit more difficult and that's why we'll see that YOSIS has support for a ton of FPGAs and also other things like ASIC. So the YOSIS Procilings, Altera, Alda, a lot of FPGAs. Well next Pinar is much harder so they started with Claire Wolf and the Ice Storm project for the Ice 40 FPGAs which then co-expanded to the ECP5 FPGAs also of Lattice and these are the two main architects currently supported by next Pinar but there's work going on to expand this to while going in my case and some other projects that are also working on this. But what next Pinar does not do is generate the actual bit stream so we need this extra step where you take sort of the assembly language from next Pinar and turn it into an actual bit stream that you can program to your FPGA and what this talk is mainly about is figuring out the bits you need to do the next Pinar bit where you place and route all these multiplexers and lookup tables and flip flops that are inside this FPGA. So to get started, step one, get the license. This is an unfortunate part of commercial FPGA development as you have to get the commercial software and this can vary how it works but for going you have to fill in this form and this can be a slow process and from what I've heard is preventing some people from using it for example in university when you have your whole classroom wanting to work on FPGAs and you have to wait a week for your license to arrive it's not very nice process. So this is also where the source can be really advantageous. Next step is to get an FPGA or maybe yeah depends on maybe everyone already but the nice thing about going is Cipit release is really, really, really cheap FPGA board so you can just spend five dollars and you have an FPGA. This is second because you don't immediately need it. The first part is just getting the software because only once you get your software working can you actually use FPGA and then actually install the software which is also not trivial in some cases because yeah there's software for Linux, Windows Linux but in reality Linux in the EDA industry means Red Hat Enterprise Linux so if you are on Ubuntu or Arch Linux or whatever you have you are probably out of luck. So if software is not deleted as well because this ship is fairly old like C++ in the library files that don't work on modern Linux so you just delete it from the bundle then that uses your system library and it generally kind of works and then the first step is really boring just read the manual to try to use it as it's supposed to be used and try to get as much information from it as you can because like everything that's already there you don't need to know document yourself so it's really a time saver to spend some time here basically. The goal of this step is basically to find a way to take the lowest level input that you can generate and automate it and then get as much information as the generated report as you can and for going this means they have a TCL shell which you can write scripts to input netlist, fairy log whatever and then synthesis and place unrooted. The synthesis tool produces a fairy log netlist so and then yeah so the lowest level you can go in this particular tool is place unroot this netlist and then constrain every single cell you put into it to be in a specific location so you have sort of a deterministic output and well the positive side of this particular FPGNA is that their bitstream output that goes into the programmer is an ASCII format so I mean it's not human readable it's still like binary it's really weird but it's sort of the framing is already done for you so you don't need to sort of look in the hex editor. The downside of this particular tool is that the other outputs are kind of useless so yeah there's not much info on timing or routing or anything that you could like extract useful information from. We'll get to that later also. Now on to fuzzing. Fuzzing on an FPGNA basically means you generate a bitstream you change the tiniest tiniest tiniest bit of information configuration about this bitstream generate another one and then you compare them and then you know okay this this minimal change that I made result in this one bit flip or whatever and then you can sort of start to make these connections what the meaning of all these bits are and then you repeat it and repeat it and repeat it so that's the basic idea. So the step one is you write a net list and here I wrote this fairy log net list that you go and ID uses. It has the module and then this lot four primitive which is a lookup table with four inputs which is about going uses it's of a lot four architecture and then it has a 16 bit so you can four to two to the fourth sort of memory where you put the truth table in which is this parameter that you can tweak and then you just write it to a file and so in the beginning I just read this really stupid sort of bash script system where you can just you know you run set on the file to replace this init parameter with an actual value and you just loop all the combinations and see which bit changes or something and the other thing you need is a constraint file because you want to place this lot in one particular location so you know okay this location corresponds to these bits in the bit stream and then you also update this location to whatever you want it to be because you want to in the end you want to know find all the bits of course. Then you make a tcl script that reads all these files, runs the pnr, you get your bit stream and whatever else output you want and then you look at the bit stream and see if you can make sense of anything. As I said for going it's a lot relatively easy because they have this ASCII grid of bits so there is this header which is just like fairly constant I just ignored it and hope it bases it to for myself in the beginning. I have some checks something there but yeah and then there is this giant block of bits that are just like yeah they get mapped to the FBJ. You see to the left there is you know some bit of padding and then it starts with the actual bits and then I found or actually someone else found that looked at this before me that at the end it has a CRC check to ensure it is transmitted correctly and then there is more padding at the end and then there is a footer which I left off here. It's worth noting that for the more bigger FBJs I've heard it's more like a command stream so it's not as easy to change one bit and see the resulting bit because once you start moving stuff around the command stream also changes so you need to dig some deeper to understand the command stream before you can actually map bits to each other. But that's the basic idea and this for going it was relatively easy. The trial set in the beginning it depends on the FBJ. There is no one guide you can do these steps and you end up with a free open source toolchain. It's a lot of discovery and trial and error. I wrote this little Python script that takes these bit streams and makes a nice image out of them or nice it makes an image out of them where you can see all these little lookup tables with the squares with the flip top next to them. Yeah from there this was basically a big NumPy array and then you can just explore two bit streams and you get the difference bits which is how I sort of figured out the differences. So yeah congratulations this was your first fizzling of a bit. Yay only a million more to go and this is kind of a problem so this each PNR run like takes a couple of seconds up to a minute depending what you're doing or even longer if you're not concentrating everything of course. So if you can imagine you know this FBJ grid is like a lot of bits, millions maybe, maybe maybe less this is a relatively small FBJ. So if you take a minute per bit you will be there a while even if you have this beefy computer multicore it'll be slow. So need to be a bit more smart about it to make some real progress. I mean with this you can already do some fun things like you can for example tell the vendor tool to generate a bit stream and then you can see okay this was an AND gate and now I make it an OR gate but just manually flipping some bits and then you can test it on your FBJ and see if it works. But if you don't want to get really something practical you of course need to understand a lot more bits and to make this more practical there is the binary trick it doesn't have really a name or anything but the idea is okay so imagine you have I don't know these lot of bits there were 16 but here's like 8 fit on a slide. But it says you want to find a location of normally you would you know flip one bit to run to another bit blah blah very slow will be 8 runs in this case but what you can also do is in each run sort of assign a binary number to each bit and then flip them according to this binary number so you can see A is off in all runs and then B is like one and then C is two and you see and this way you can only like log two runs for a number of bits which is of course much more efficient. The problem with this approach is not all combinations are unique for example there are bits that you're not testing which will always be null at zero all the time so these will not show up if you're testing A here and other there's also some bits that are always on no matter what you're doing and in this case H is also one all the time so that's a bit of a problem. The other problem is that some bits have weird combinations of relations to single-feature tweaking so it might not be always a one-to-one relation to the thing you are tweaking in your code and I think that shows up in your pit stream so this case simple example is B or C which conflicts now with D so you wouldn't be able to tell D apart from A and B and C. That's a bit of a problem for example it shows up in IO banks like every side of the FPGA has an IO bank which can turn on and off so if you turn using any input output buffer any pin on any side and the name was this whole bank if you use this binary trick this bank will always be on basically and you will never figure out what this bank bits are so you need to be slightly more fast which is the balance balance constant weight code which is sort of a hamming distance related thing but the simple explanation of this is that for each number the number of one and zero bits is always the same so in this case there's always two zeros and three ones in each number and you can only imagine that then if you have the end of two of these numbers or the or it will have more than three or less than three ones in it so already you don't have a number with all zeros all are ones but also logical combinations there's some technical term for this but most logical combinations will have also a unique code to them so that they don't conflict with your other bits that you're testing and this is always unique but it takes a bit more runs than the binary trick which is a fair trade I think. I'm not exactly sure what the complexity of this one is but it's still better than you know straight one bit runs the only problem is so imagine you are now fuzzing these input output buffers you're going through well of them but you're always like enabling one of them for example and this would mean you never see bank going off so you need what we call meta fuzzers they like sort of the first a collection of other fuzzers basically so you add extra runs that deliberately trigger these more complex relations so you have to sort of yeah well you can write a check so okay I found a combination that I don't understand for example the end of two bits and then you can give an error and then you're like okay what is this and then you can find a hypothesis and write a meta fuzzer that says okay I expect this bits to be the end or of each other bits so I expect them to have this pattern and then you have this meta fuzzer that specifically triggers this pattern so that you turn on these combinations of bits and more complex relations and this you also do with these constant weight codes but there's only one here so you don't see it but yeah that's meta fuzzers they are kind of tricky yeah so that's an overview of fuzzing a roadblock that I went into with going is that there's no control or insight into routing from what I've heard from like ice 40 ECP 5 other FPJs usually you can either control where you want the wire to go or at least you can like inspect which routes the vendor tools choose and either way you can then sort of correlate these things but it appears as far as I know that Goan doesn't offer this kind of control so this makes it almost impossible to fuzz the routing because you can't like you can't change a single like a simple thing you can't have a minimal change that sort of changes the wire it uses you just have to sort of rely on the router to pick a different one and sort of push it a bit in the right direction which is really pain so the solution is to also look at the files provided by the vendor so this vendor has this tool and of course they also have data files that describe their FPJ their binary files mock documented at all the favorable thing for a Goan IDE that doesn't have an end-user license agreement like most of the like signings altera kind of tools that prohibit you from messing with them so I'm not legally saying anything about this but this makes it quite like a kind of okay maybe to have a peek at them so yeah I complimented this fuzzling with looking at the files the vendor provided so yeah I've said basically the goal was to refer to engineer this file structure and write a partial for it so you can extract the data from it you can do this in several ways you can just stare at the hexadump and try to make sense of it and my interfaceship supervisor actually was like a superstar at this who could just stare at this hexadump and see immediately all sort of things are like how did you do this but I'm not such a superstar at this so I kind of went for other approaches first and to run the program in GDB you can just set breakpoints on things and see if you can extract some information from the memory in the program and also you can decompile the program and look at the assembly in something like vitra I don't know how you say this so example in GDB one thing I did was okay you run a TCL shell you let it start up so you don't get all the startup noise and then so that there's a breakpoint up on the fopen call and continue run the place and route and then you continue continue continue a few times and eventually you find this interesting function call where it reads the route data table so there's some debugging information left in this library which is convenient and we had points to this data file of the go-in IDE so I opened vitra and look at this route node table class and turns out that it's it reads this file like straight into a struct node encoding node decoding to struct and to disk and struct from disk but it had all these getter and setter methods that are like exported as symbols so I could just like take the setter name and the address it was pointing at and this would directly correspond to data in this data file it was a bit laborious going like through the data file getting all the setters and make it into a little Python script but in the end you get a Python script that can extract these things and at least have some names to them the other fun thing I did was to write automated gdb scripts like you normally use gdb like interactively typing into it but you can also just tell it to load scripts and this script in particular breaks another function that the reads another type of data file and then it breaks at every read out of this file tells you to offset into the file how much bytes were written and the function they were written from so you can sort of correspond these functions to the addresses and blah blah blah and then this this this particular file was more of like an archive with tables and different things in it more of like an actual file structure that was a bit more involved but then you write a parser for it and you can extract this data unfortunately data is not equal to meaning so you have this bunch of binary numbers and you don't know what's going on and they also did some interesting techniques for the encoded data is like decimal digits of a binary number interesting but okay you can look at this data and you have some names from the exported symbols the simple ones are like the little bits there this you know bit one is this but bigger challenge was routing which was the key thing that I wanted to do this for and this in the end I managed to figure out and extract the routing information from this FSE file but it wasn't completely brilliant solution but for example the IO buffers they tend to be very complex in FPGAs they do a lot of different folders levels different modes different everything and then I start making sense of all these random fuses and things and that turns out they're also different per FPGA and I was like oh my god okay this doesn't make any sense so then I went back to fuzzing the nice thing is yeah once you have the fuzzing or the chip data the fuzzing becomes a lot easier because you kind of know what you're looking at so you have this from this data files you can sort of extract the tiles and you know okay all these files are the same type and you know their boundaries and their size so you don't need to duplicate all the fuzzing work for each and every tile you can sort of first each tile separately so this leads to some simplification of speed up where you can yeah first per tile basically per tile type so yeah in this case the little bits and the basic routing must from the vendor data directly no fuzzing involved but IOB and data flip flops are first basically so what I did in this case is was like my third fuzzer with first you know the batch script then I did the binary tricks and then I had this tiled fuzzer after the chip they made encoding and what I did here is yeah you take like you know okay you have all the modes that you want your flip flop to be in for example so you say okay I have tile type 12 for each flip flop type put one every tile first like run the PNR and then you extract you know you go back to the tiles and see all the different modes that this particular flip flop can be in so you do this per tile type and you can go much faster you don't have to do all these binary tricks and what you do have to do is logic for combining as much different fuzzers into as little runs as possible to optimize for speed this can be quite confusing and complicated and I think the limiting factor here was the I put buffers because as you can see in the middle there's a few that are only like a specific type so you need you know you need all the types of IOB that you want and this particular tile type you need to do it for this tile type and there's only one of them so it's kind of slow but it's still more efficient than all these binary tricks because you only have to do it once per tile type basically the hard part was the clock filter so I didn't talk to miss too much about this yet but in FPGA you have like the intertile routing which is like these wires to neighboring tiles but there's also like global routing which are generally used for clock trees and resets and other like high fan out signals that you want to question all FPGA in the go-in there's like eight of them and well yeah their boxes are in the chip DB so in this sense you have the basic data but like okay so for the for the intertile routing their names are kind of obvious is that you know there's like north two tiles number three so it's the third wire you know sort of like the information is encoded in the name for the clock router it's not like it's very irregular also if you see here like these horizontal lines are spines they go from the center where every global signals come in from the PLL and input pins and then they sort of spread out across the spine column and in each spine column there is one multiplexer that connects it to a depth which is just like a fertile running wire but there's only one depth per column so you have to kind of figure out which which column is connected to which spine basically and you can see here sort of like one two three four one two three four one two three four but you don't know that you have the right fuzzers that are basically take up the whole FPGA where you just sort of scan flip flops across the road to see which spine they get connected to and also this this local local horizontal branches they're called they spread out a few tiles from this step but it's kind of irregular how far they spread and which step they connect to so again this buzzer has to sort of sweep flip flops across in several hundreds of rounds to see which row connect to which branch and then to which step and which spine and in this picture I drew like the four primary ones and there's also four secondary ones and it's a big mess so this was some recent work that I did that to improve the global clutching and then now working on this next to be in our part to incorporate that and yeah after you done all this fuzzling on the style decoding you can sort of figure out the tile format and you can see here at the bottom eight rows or what four maybe yeah I think four actually well a few rows there's like the lots and lots and flip flops and then everything above it is just multiplexers it's like you know 80% multiplexers which I mean like even if you knew that there's lots of multiplex this was just kind of still like oh wow it's okay moment to see how much this yeah this is just a picture I generated from this fuse file where you just color everything differently and the memory memory brought the labels on top of them pretty picture not that insightful yeah then you can start to generate these placements and routing for the stuff that you decoded and you have to fuel fully open-source FBA toolchain this particular one is running a risk 5 core that is calculating primes and it's running on this go-in board of trans electronic that's also laying behind me here and yeah that's my story and this is project that the Cooldown and Son GitHub you can check it out contribute start your own FBA reversing engineering project or join some other one or well I hope you enjoyed it and thank you for listening and I think there will be a Q&A after this so this concludes the talk thanks very very much Pepin very very interesting and of course there are questions one question that comes from the IRC channel is does the go-in software provide a simulator for which useful data like timing or so could be extracted by just observing the simulation process yeah so thank you the they do not provide their whole simulator they do provide timing file like not not they provide behavioral models for very long so you can take your very long models from them and simulate it but that's behavioral simulation doesn't include timing data there are some encrypted models somewhere floating around I think but there's of course encrypted so you can't easily extract timing data from them so then it's easier to just decode a timing database as they also have right so because I didn't see another question things again for your great talk I see that you have a talk right now as well on the same channel and thanks again for the people who still want to ask questions to Pepin just go to the RC3 chaos zone channel and ask your questions right away and yeah Pepin will try to answer them as well thanks again and see you soon bye bye