 Okay. Okay. Hi, I'm David. I'm a student at Imperial College London and an engineer at Symbiotic EDA. So I'm going to talk a bit today about open source FPGA tools, just to get a kind of quick idea of the audience. I mean, how many people have used an FPGA anyway before? Oh, cool. More than I was expecting. Awesome. So just for people that haven't before, so what is an FPGA, field programmable gate array? It's programmable digital logic. If you look inside a modern commercial FPGA, there are two really fundamental elements. There's a lookup table and a D flip flop. Lookup table has maybe four, six inputs and one output, and you just tell it what the binary value of the output should be for all possible combinations of inputs. The other fundamental element is a D flip flop. That's just one bit of memory storage, and that's used to build sequential circuits that do more than just combinational logic. All those elements are connected together by user programmable switches. So effectively you have a big, big device full of programmable logic that you can build any logic circuit you want on it. So it's a much cheaper way to do your own logic than taping out your own ASIC while being a lot neater than putting a load of logic chips on a breadboard. This is all configured by something that we call a bit stream. It is literally that a load of bits that just set up all this functionality, set up all the wiring, fill the lookup tables, et cetera, et cetera. Unfortunately, this is the sad bit. Most FPGA development uses the officially provided closed source tools and the companies that sell FPGAs, they don't tell you how these bit streams work. It's not like a microcontroller where you get a big nice document telling you what all the registers do. There's nothing like that for an FPGA. So I'm going to particularly talk about one particular FPGA, the Lattice ECP5, and that's what I've been working on recently. So that has up to 85,000 logic cell. A logic cell is a standard element in FPGA. That's one four input lookup table, one D-type flip-flop, and some dedicated carry logic for building fast adders. It's got, as well as those basic elements, it has 3.7 megabits of block RAM. So that's bigger, 18 kilobit blocks of RAM that you can build for larger memory for processors, for cache, all that kind of stuff. And it also has 156, 18 by 18 multipliers. So that's really, really useful for things like signal processing, video effects. I think the kind of vendor's intended market for this FPGA is things like small 4G base stations. And finally, it has what we call 30s, three gigabits per second or five gigabits per second. That's for doing fast interfacing, PCI Express, USB 3.0, those kind of things. And it's that cheap as well. So it's, pricing starts from $5 for the normally 12K logic element device. That's quite a bit of FPGA for your money. It's bigger and cheaper than the I-40 FPGAs that the open source of the past has supported. So looking into the architecture of this FPGA in a bit more detail, you split the chip up into what we call tiles. And each of those tiles is then split up a bit further into four slices. A slice is two lookup tables, two flip flops. It can also be configured as a 16 by two RAM and two flip flops. And there's some two-input multiplexers for connecting those lookup tables together to build bigger lookup tables. This is quite similar to slices in lots of other modern FPGAs. The Xilinx FPGAs have a very, very similar structure. They just use six-input luts rather than four-input ones. And then you have a large number of fixed interconnect wires inside the chip. And those are connected together by what lattice call arcs. But in the open source world and in the Xilinx world, what tend to be called pips, standing for programmable interconnect point. And unlike some earlier FPGAs, all the wires and arcs are unidirectional, so signals only ever go in one direction. So if you actually look at how the connections work, all the programmable connections are basically just multiplexers between wires. And finally, you have a dedicated global clock network. So the clocks of all the registers have special wiring going to them because your clock needs to go to almost every point in the device. You don't want to be using general routing for that. So if you do a plot of all the tiles and you can find this inside the open source bitstream documentation, this is what it would look like. And this is the smallest DCP5. So FPGAs become quite big quite quickly. So obviously this is not much use, but you can start to see some rows of RAM, big, big C, lots and lots and lots of logic tiles and IO around the edges. So if you just look at all these different things, you've got the yellow, which is DSP and IO at the edges. And finally, the high speed sirdies at the bottom. So just zooming into a much smaller part of that, so that's the IO around the edges. There's the logic tiles and the logic tiles contain both the logic functionality and also the interconnect, so the connections between wires that's programmable. There's actually a bit of a anomaly here because when you look at, for example, a RAM tile, a RAM tile doesn't have any interconnect in, it just has the functionality for the RAM and then you have a separate tile called a CIB which has the interconnect in. And the other interesting anomaly is that they split the device up into a kind of grid system but they don't really follow that grid very well because you end up with, in some cases, four or five tiles all at a single location. So these tap drive tiles are part of the global clock network. So I mentioned that it's a unidirectional architecture made up of multiplexers. Once you've looked at the bit stream, you can actually start to get an idea what the circuits on the silicon would look like. So this is basically what programmable interconnect and FPGA looks like. So these zero, one, two, three, four are bits in the bit stream. So these select one of these six signals coming in to a signal going out. So for example, that connect to the input of a logic function and these would be your general routing throughout the device. And you always have this cascade of two multiplexers which works out to be quite an efficient way of doing this in the silicon. You can, in fact, create a short circuit. For example, if you enabled one, three, and four, you would actually end up with an internal short circuit. And in theory, if you enabled enough of these, you could possibly get the chip to de-solder itself as legend goes. It's not something I've tried and it's hopefully something that will never happen but could be a fun experiment. So the open source tools, where are they at the moment? So we have bit stream and routing documentation for almost the entire functionality of this chip. The only thing we're missing at the moment are some of the more obscure modes of the digital signal processing stuff but that's something to look at a bit later on. We have documentation of all the internal delays, the timing for the core fabric, the logic cells, input and output, and RAM. And then we can use that in a timing-driven use this at the next P&R flow supporting the majority of functionality going fully open source all the way from Verilog to bit stream. So this open source bit stream documentation, this is part of the documentation for a logic tile. This is an overview of what all the bits do. So for example, you have the LUT bits at the bottom, routing bits for the D, B, C and A inputs, vertical routing bits, some miscellaneous input bits. And then if you scroll a bit down in the database you start to see individual configuration bits for particular functionality. So this is a routing multiplexer connecting from some long distance wires and the LUT outputs onto another long distance wire. So these show which bits you have to set to enable the connection from that signal onto that signal. So you can, if you remember the two input, the two level multiplexer I showed a couple of slides ago you can see how that's represented in the bit stream. You always have two bits enabled at a time for any connection. Then looking at something like the bits to configure a LUT inside the database. So you have 16 bits to configure a four input lookup table and we just map each bit to say which bit in the bit stream does. A little interesting thing in the FPGA is the bits are for reasons I'm totally unknown, actually inverted and that's just represented with the little exclamation mark in the database. No idea why they did that but I'm sure there's a very, very good reason. And then you have a few more settings. So for example, the output widths of a block RAM just have a list of the possible values and the bits you'd set to enable that width. So to play about a bit more I decided to make a textual configuration format because a bit stream is just a series of bits. It's not very useful to a human. So the idea is how can we represent this in a way that's still very low level not like very log source or anything but that's easy to see what's going on. And this is to test the fuzz results. And so I built some tools to convert bit streams through or from a textual config format so you can use that to check that you're getting sensible results for simple designs. Look for unknown bits in bigger designs. Check that you've worked everything out. And I also ended up using this format as an intermediate format for post place and root designs in the end. So this is what the format looks like. You have split up into tiles. You have the arcs, the connections, configuration words like lat initialization and the enums which are like textual settings. So this is a slice configured in carry mode. So as well as bit stream documentation to end up with a useful FPGA flow in the end you also need to document the internal timing because again you need to know that a design whether or not a design can work at a given frequency. So you need to know how big the delays are for the routing and the functionality in order to do that. And again like the bit stream documentation they provide some very very high level documentational timing for example how slow a 16 bit adder might be but it's nothing useful enough to build tools with. So again we had to document that ourselves so the vendor tools can create something called an SDF file from a design which gives you all the delays through the cells so getting the delays of things like LUTs was easy enough to extract from that. Routing was a bit harder because they don't tell you the delay for a particular connection. They tell you the delay for a whole net which has multiple connections in. In the end I had to build up a kind of rough hypothesis of how the model worked and then throw it into a least grad solver comparing what I thought my model would be with a load of unknown parameters against what the vendor tools say the delays actually are to work out all the routing delays for the individual switches inside the FPGA. So that's kind of the documentation side of this talk done. And so now from an end user point of view what's the kind of actual FPGA flow that makes use of this documentation to do useful stuff? So the first part of the flow is YOSIS. YOSIS is an open source synthesis framework, Verilog synthesis framework. It now supports multiple FPGA families. So ECP5, of course Ice 40, the traditional open source FPGA. It supports synthesis for Xilinx FPGA. There's some very experimental support for Intel and I think Go-in FPGAs. Miraag's been working on support for analogic FPGAs. So yeah, it supports a pretty good range of FPGAs but it's not just an FPGA synthesis tool. It can do ASIC synthesis. It uses Barclay ABC as its primary route for logic optimization, although there's some work at the moment at looking at other ways of doing that too. As well as synthesis it can also do formal equivalence checking, it can do assertion-based formal verification, which is something that, as far as I know, there is very little else out there in the open source world and this can be a very, very powerful way to verify designs. And it's got all kinds of other really obscure things. I think it's got a spiced back end. It can do simple simulations. It can do transformations, all sorts. So that gets you from your Verilog to a net list, all FPGA-connected primitives, so LUTs, flip flops, but it doesn't actually tell you how they would fit together on the device. So that's for the place and route tool to deal with. For that we have Next PNR. This is a new open source multi-architecture FPGA place and route tool. We started developing it in early May as a replacement for a RACNA PNR, the existing tool for ICE40 that was very, very much a tool for ICE40 FPGAs only. It wasn't really portable to any other FPGAs and the other alternative we looked at was VPR versatile place and route. That's quite a well-known academic tool but it's really not very useful for doing real place and route for real FPGAs generating bit streams. So yeah, it's very much the design of Next PNR is for real FPGA bit stream generation, unlike VPR which is more for academic architecture research. And again, unlike RACNA PNR, the another older open source tool, it's fully timing-driven. So it's multi-architecture but unlike previous multi-architecture place and route tools, Next PNR architecture implements an API. It doesn't just provide say a set of fixed XML files or JSON files. So that gives you a lot of choice in terms of how you store the device database. And you can also implement things like custom packer to combine logic together. This turns out to be a very architecture specific task. You often need architecture specific logic that's quite hard to describe in a flat file but very, very easy to describe when you're implementing an API. So if you look very briefly at what an architecture has to provide, it has to provide some black box ID types. We don't even mandate these because some architectures will just have a flat ID. In some architectures, it's easier to represent an ID as a location and an index. Then you have functions like get bells. So that's the list of bells, the list of blocks inside an FPGA. Get pips, the list of connections. Get wires, the list of wires. And then you have things like get pips up here on a wire. So the number of pips that can drive a wire, for example. And these are specified to return some kind of range, as we say, of a bell ID, of a pip ID, et cetera. Again, we don't actually mandate what that range is, how that range is implemented. So we say, you know, range, typical C++ range has to implement begin and end. Those return iterators, iterators have to implement plus plus dereferencing, not equal to, this is, I think, quite a bit more liberal than C++ specification, really. This is all we require. So if you're doing an architecture where maybe it's quite small, performance isn't a priority, just return a reference to an STD vector. If you're doing a big, big architecture, big ECP-5 Xilinx FPGAs, these iterators can actually be custom walkers over a complicated, deduplicated database structure. So there's a lot of flexibility here. So each architecture has its own folder in the next near-dark source tree, and we build a different binary for each architecture. I know this seems like quite an old-fashioned way of doing polymorphism, but it has a lot of advantages. We can do heavy, heavy compile-time optimization, lots of inlining can go on, and architectures can provide their own data types, but unlike, say, you're doing this with C++ templates, you don't have the big, big build cost, then, of parsing a whole load of all your code ending up in header files. So XP and R has good support for ICE-40 and ECP-5 FPGAs. There's some more experimental work going on on other FPGAs. There's a very, very experimental support for Xilinx 7 series using not a full open-source flow, but using Torq, which is an academic project, to get the device database, and XDL going through ISE to do the bitstream generation. But this is very, very experimental. It's mostly for doing research on very, very big FPGAs. It literally just supports lots of flip flops and no RAM, nothing fancy, but we're hoping to develop this Xilinx support further this year. And in the future, we're also looking at a so-called generic architecture where you can build up the FPGA programmatically using a Python API, maybe even specify the list of wires in a CSV file, for example, looking at things like that. Over the summer, we started with a very, very basic set, the very traditional FPGA place and route algorithms, simulator and e-link placement, and a kind of A-star-based router. Now that those are working well, we can look at more advanced algorithms. So I've been looking at ways of improving the placer, path-driven detail placement. Just two weeks ago, I've started working on an analytical placer, which will give us much, much better performance on bigger FPGAs than the existing simulator and e-link placer. Meanwhile, others are looking at SAT-based placement and packing that will give very, very, very good performance, but will be quite slow. So that might be good for the small-ice forties where you're really, really pushing them to the limits. As well as being extendable by writing C++, we've also got Python API. That's usable for writing extensions. We also use it in place of the TUCL API that vendor FPGA tools tend to have for implementing timing constraints, doing small manipulations, even prototyping new algorithms. And as well as that, it's got a graphical user interface. So you can see this is actually an I-40 FPGA, interactive Python console, and you can explore the net list there. So just sort of idea what the open source tools can do. This is a SOC, open-risk SOC, booting Linux implemented on the ECP5 FPGA built. Very long to bitstream with open source tools and an open source design in the first place. That's the, yeah, the ULX 3S ECP5 board from Croatia. Finally, a very, very brief announcement. If this stuff is of interest to you, there's going to be a workshop on open source design automation, looking at open source tools for FPGAs and ASIC. That's Friday, March the 29th this year at date 2019, which is sort of EDA tools event in Florence and Italy. So maybe that's of interest to you. So interesting finding out more. So project trellis, the bitstream documentation is there, Yoastus, Synthesis there, Nexpy and R, all on GitHub, all ISC licensed, permissive open source slides. And if you want to get involved, then IRC is a great place to get involved. Hash Yoastus, hash hash, open FPGA on FreeNode. The question is, can you build a 10 gigabit a NIC with the ECP5 if you paired it up with a PCIe5 sort of thing? So the ECP5 has 430s in total and you need exactly 430s to do 10 gigabit ethernet with it, running at S, I have, I don't remember the name, the 4, 3.125 gigabit ethernet protocol going into a PHY to do 10 gigabit ethernet. So that's fine, ECP5 can do that, but then you're left with no 30s lanes for PCI express so you would then need to use an external PCI express PHY using regular IO. So it's doable, but it's very much at the limits of the ECP5. For a given input, can you guarantee your bitstream will be the same? For a given input, can the bitstream be the same for a build? Are you talking about against ourselves or against the vendor tools? It suddenly aims to be reproducible, yes. There have been a few places where you've accidentally relied on unordered map ordering in C++ which tends to mean that it's reproducible on the same machines, but not on other machines, but I think we've got rid of all of those so it should be reproducible now. So the question is, you mentioned RAM, is DDR3 possible? Yes, the ECP5 has IO that's definitely designed to support DDR3. I'm on the last stages of the open source tool support for that. My master's thesis is very much in the direction of getting the tools to a point where a DDR3 controller can work. How do the vendors react to these tools? Lattice are a really nice company. So we've had the Ice 40 tools out for a few years now and they certainly haven't done anything negative. They're European sales division, invited us to give a workshop on them and things are starting to look quite good. So yeah, Lattice are a very good company. No comment. The question is, can you go the other way around from a bitstream back to RTL or Verilog? The answer is yes, we have a tool like that for the Ice 40, Icebox V-Log that goes from an Ice 40 bitstream back to behavioral Verilog. I haven't actually finished doing something like that for the ECP-5 yet, but it's entirely possible. It would just be a case of getting around to doing it. Yes, so the question is, can you explain how you get the timing information from the vendor tools? So what the vendor tools, what you can get from the vendor tools is the list of internal connections on a net, so the list of connections between the fixed wires, the pips of a net and the total delay of that net of all those connections. So effectively, you assume that each internal connection has a certain amount of delay and you can sort of split them up into classes. You say all connections of these class have this delay. Then you can build up a sum, equate that sum to