 So, the next talk has the subject, the next PNRFos FPGA place and route tool, and it will be given to you by Clifford Wolf, who is a very active member of the project ISTORM, and whoever listened to the talk before already got step-by-step introduced to his work, and basically he will tell you now something about an open source FPGA tool chain. So, give a warm hand of applause to Clifford Wolf. Thanks, so I'm going to talk about the next PNRFos, which is our next open source place and route tool for FPGAs. So to get us started, I'm going to talk about the ecosystem of open source place and route tools that we have right now, FPGA place and route tools, there are obviously a couple of ASIC place and route tools as well. There is VPR, which is about 20 years old, it has timing driven algorithms, it's fairly portable in terms of porting it from one FPGA architecture to another, but the focus of VPR is more on architecture exploration, so it allows you to easily try out different variations of your interconnect architecture, things like that, but it's not really aimed at creating bit streams, creating placements and routes for real world FPGAs, because the real world FPGAs, they actually are not as regular as the made up FPGAs that VPR is targeting. And yeah, so we needed something else to not only do architecture exploration on theoretical devices, maybe do some experiments for new placement of routing algorithms, but for targeting real devices, we needed something else. So when we did Project ISTOM, we wrote our own place and route tool just for that, which was called Aragne PNR, by now it's about three years old, it was not timing driven, so it would just create a valid routing, a valid placement for your design and afterwards you could run timing analysis and you could learn how fast it is, but you couldn't really tell the tool, actually this net should meet this timing constraint, so please take this into consideration when you place stuff, when you route stuff. And it really only supported IZ40, so we thought about maybe taking Aragne PNR and extended it to more timing driven algorithms, to architectures beyond IZ40, but it was really written and just dedicated for the IZ40 architecture, it was not really meant to be a portable tool. Yeah, so given this situation, we decided we want to have a new placement route tool that we can use on our way forward to creating open source tools to target actual devices that you can buy, development boards that you could use, so we started next PNR, it's brand new, so we started working on this this year. It's timing driven, it's very portable and it's a team effort, which is completely new for me, so if you follow my projects, usually my projects are I lock myself in a dark room for a couple of months and then I emerge and say look what I've done, but now we have a couple of people all working on that and it's a really great experience for me. One of the things that we get when it's not just me fiddling with something for the team of people, we get the kind of features that I'm not really good at implementing, so for example we have a GUI and you can actually see the schematic of the device and zoom in and I'm going to show you this, but of course it still works very well with make files and of course most of the projects never use the GUI, we just run it in a make file or a shell script and it does what it's supposed to do. There is an API for supporting new architectures which is a similar, this similar approach to what many other tools do, so the regular approach would be to have a common data structure and you just have to create this data structure in memory that describes your target device, we don't have that instead every architecture backend must implement the same API and then this API is used by the algorithms to access the architecture database and the reason we do that is because we think, I think, that one in-memory representation might not be universally a good representation for all different target devices that we want to support and I'm going to talk about a little bit on the later slide. So right now we have support for ICE40 and ECP5, so these are the things that we actually recommend you to use. There are also some other things that are ongoing, so we have a so-called generic target, I'm not quite sure if it actually builds right now, it might have bit-wrotted a little bit, but the idea is that you can use this to generate an architecture database on the fly using a Python script and with this we can also cover a little bit the architecture exploration aspect of things. We also have an experimental binding to TORQ, which allows us to do a place and route for seven series Xilinx devices, for example, but those are the things that we don't recommend everyone out there use. This is more the kind of stuff that we encourage you to have a look at if you're really interested and maybe if you would like to spend some time improving those things yourself, but as a user right now we are good for generating designs for ICE40 and ECP5. Yeah and some of the future, all of those future plans here already implemented, so this is a reduced slide. So we have floor planning now, we support multi-clock designs, we do support complex constraints like relative placement constraints and stuff like that. Okay, so this is actually supposed to be a video and I spent the last hour S-tracing G-streamer to figure out why it doesn't really work, but I guess you can see what the video shows. So just to demonstrate you that we can actually do non-trivial designs using this tool, so this is an example for ICE40. On the ICE40 side of things we support all of those devices here and we have a full verilog to bit-stream flow. So we use users for synthesizing the design and users in this setup will generate a JSON file and then this JSON file is read in by Next.pnr and Next.pnr will create one of those .asc files for iStorm and then iStorm will read that file and generate the actual bit-stream. The board that you see in this non-playing video is an icebreaker FPGA board and it's the board that we are right now pushing, so there's a crowd-supply campaign right now, that we are pushing as a platform for teaching people FPGA design. So we've worked with a couple of different dev boards over the last few years and then we took all the the peeps we had with the different development boards and created a list of requirements. What we think a good dev board should have be if the target application is not necessarily prototyping commercial products or something like that, but instead if the application is teaching people FPGA design from scratch. And personally, I really like iSforti as an architecture for that because iSforti is a fairly small FPGA because when you do some teaching projects and you have a ridiculously large FPGA, people can write ridiculously inefficient designs and it will still fit and they will never learn that actually your seven-segment decoder should not take 150,000 gates. Whereas with iSforti, you can actually set a goal that is a little bit challenging for the size of the device and then people have actually to learn how to make efficient designs in order to achieve that goal. Okay, so here's another video that doesn't play. This one demonstrates that we can work on an ECP-5, so we have support for, I think, for all of those devices, but David will know it better. And yeah, we have a full variable to bits and flow again, so we can use users for synthesis. Again, this will generate a JSON file and then we can read that JSON file in next PNR. Next PNR will do place and route and then we're using the project trellis infrastructure for bit generation and actual device programming. And in the video that doesn't play, you actually see Linux running on an SOC running in that FPGA just to demonstrate that we can actually do place and route for fairly non-trivial designs like Linux capable SOC. And if the Linux would play, you would actually see that we can use shell commands to turn on and off the LEDs on the dev board. Okay, so this video apparently starts with a black frame. But that video just shows an Xilinx dev board blinking to convince you that we have also support for the Xilinx devices. There is a reason why the video that you don't see shows a little blinking example and not Linux capable SOC, because right now we still have a couple of timing issues to figure out with targeting the larger FPGAs. Yeah, but we are working on that and I'm very confident that like by the end of next year, we might have completely different things where we say, oh, this is the stuff that we can't do yet, but this of course, yeah, not a big deal. Okay, yeah, right now, because this is based on talk, I should tell you what talk is. Talk is a library that can be used to do place and route experiments, I would say, for Xilinx devices. And this will generate a so-called XDL file, which is something that you can read into the old Xilinx tools, the ISE tools to actually generate the bit stream. So we don't have a full end-to-end open source flow here right now. However, what we have is something that picks out the difficult part for the context of next plan R, which is actual place and route. And only for the bit stream generation, we need to defer to ISE in this case, but it shouldn't be too difficult to set up something similar using the project xray database and or using Xilinx rapid write, which is a rather new thing from Xilinx that would allow people to generate custom bit streams. Okay, screenshot. We will see more of the GUI at the end during the demo. But yeah, this is what it looks like when you run it in GUI mode. We have a device view. In this case, it's a place that drought the design so we can actually see the place cells, we can see individual nets. Just to make this screenshot more interesting, I decided to color some of the nets in. The thing that we see on the left is a carrier chain, for example, the Mangenta nets. You see we have a console window where you see the stuff that would just be visible on the normal terminal console if you wouldn't start it in GUI mode. And at the bottom you can see this is actually a Python shell and that would allow you to access all the internal data structures and kick off all the algorithms and stuff like that. And the right hand side of the screen is dedicated to something that will allow you to investigate both the device database that describes the architecture that we are targeting, but also the net list that we are trying to implement on that architecture. So bells, wires, pips that would be the device specific categories of objects and cells and nets that would be the net list. Okay, a few words about the overall architecture of next PNR. So we have a couple of what I would call front end components. Some of them are common, which means shared between different architectures and others architecture specific. So for example, the code that can read a net list is common. We need the same code to work for all the architectures. But the packer, for example, where we take individual, for example, lookup tables and pack them into larger logic cells. This is something that is architecture specific. Every architecture would need its own packer. And of course, we are trying to design next PNR in a way so that we can move as much code as possible to the common bits and reuse them in the architecture specific pits. So on the one hand, for each architecture, you will need to write your own packer. But then on the other hand, we try to provide the infrastructure that makes it fairly easy to do so. Yeah, then there is a router, of course, the placer. But they might be interleaved with similar things that are architecture specific. So in in a couple of scenarios, for example, you might have different routers. You have architecture specific routers that do specific clock routing tasks or things like that. And then you might follow that with the generic router that routes all the ordinary nets that are remaining, depending on what what is good for you, the architecture that you're targeting. And those architectures, they access a database that contains the design that we are placing right now, the net list, and it accesses the architecture database through this architecture API. And that means that the architecture database itself can be stored in a very different way for different architectures. And this is indicated in this picture, we see we have a couple of different implementations of that architecture API. But then we have even more architecture databases, because we might have one architecture that has one implementation of the API, but then five different databases in the back for the five different device types that are available in the family that you're targeting. Okay, so I have two slides about nomenclature that we use within Next PNR. It's not so much because I'm going to use this nomenclature a lot in the slides that follow, but more to give you a better impression about what are the kinds of objects that we are managing in Next PNR. So that's the reason why I'm talking about this. So there is the design database, that's the net list we would like to implement on the target architecture. And in this thing, we have cells, which are just the things like lookup tables or logic cells or flip flops that you can instantiate on your target device. And each of those cells has one or more parts. I guess most cells have more parts, but there are some that actually have only one. And then those parts can be connected to nets. And this net, of course, can be connected to another part of another cell. And that's how we see the connectivity of the entire design. When we're looking at a specific net, there is one part on one cell that is driving that net, and we call that a source. And then there are more other parts from possibly other cells. And we call those sinks. And each net has exactly one source, but it can have an arbitrary number of sinks. And a source sink pair, we call that an arc. Because that's a very important entity in the routing algorithm because the Next PNR routing algorithm is actually prioritizing individual arcs. Whereas VPR, for example, can only prioritize nets as a whole. Okay. So this is the design database side of things, the objects that we can manage here. And then there is the architecture database. The big distinction here, of course, is that the architecture database is static. So I say I would like to target this or that part, and then it's always going to be exactly the same database. Whereas, when with each run of Next PNR, I might give it potentially different net list and a different design to implement. Okay. What kind of things do we have in the architecture database? There we have bells, short for basic elements. And these are the actual things that physically exist in your FPGA. So the placement part of place and route essentially is finding an assignment for each cell in your design to a bell in your chip. And similar to the parts on the cells, we have pins on bells. And then we have wires that make up the interconnect of the chip. And we have ways of connecting those wires together in a very dynamic fashion. And those things are called PIPs, programmable interconnect points. So each PIP essentially has a source wire and a sync wire. And I can activate the PIP or deactivate the PIP. And similar to placement being creating a mapping between cells in your designs and bells, routing means creating a mapping from all the nets in your designs to wires and PIPs on the chip. And then there is another thing in our architecture database, which is called groups. And groups are not used by anything except by the GUI. They are just a means to group things into larger units. So usually when you look at an FPGA, depending on the vendor, you might have organizational units like slices and then a couple of slices make up a tile and then those tile maybe make up a column and stuff like that. And you can use those groups to, yeah, preserve this kind of structure in the GUI. So say you have an FPGA, you build your own FPGA, you documented an FPGA, something like that. And you would like to add a new architecture to next PNR. What would that look like? So the first thing you would need to do is create a new top level directory in the next PNR code base and add a little bit of CMake magic to make it appear as architecture. And then in this directory that you created, you need to add at least two files. One is called Archdev.h and the other one is called Archdev.h. And Archdev.h contains a couple of data types that are architecture specific. For example, one of those data types is called BellID. And different architectures might use different underlying data structures to represent the Bell ID. So on one architecture, it might just be an integer. On another architecture, it might be an struct with an X and a Y coordinate identifying a tile. And then another integer that is like the Bell index within that tile. And we want each architecture to be able to use the most compact representation possible in that architecture without losing the flexibility that we need to support really large devices. And the way we do that is by making all those data types, architecture specific data types. And then each architecture can choose whatever they want. And the C compiler will try to do its best to actually implement the thing, C++ compiler. So Archdev.h, that's where the data types go and then Archdev.h, that's where one big class goes, which is called Arch. And this class has to implement about 100 different methods that are this architecture database API. And then you might need to add a couple of additional files, C files with additional functions for your architecture. The most important one here might be main because we actually don't have one main function for NextPNR. So when you run CMake, you tell CMake what kind of NextPNR implementation you would like to build. There's one binary NextPNR-Ice40 and other binary NextPNR-ECP5. And each of them has a different main functions because you might have different command line arguments and stuff like that specific to different architectures. Okay, so this architecture API is settling down right now. We haven't changed much lately. There are still a few things that I want to change. Long term, we need to freeze it somehow, but it's still a little bit moving. So don't implement like 100 architectures at the same time right now. Go ahead, it's slowly one by one. Also, with each architecture we add, we might identify a new thing that we actually need as functionality in that architecture API. Because of that, we are trying to implement one architecture after the other rather slowly, but we also want to cover a lot of different spots in the overall FPGA design space so that we are sure that we have already implemented all the functionality in place by the time we say, okay, this is now the frozen 1.0 version of this API. Okay, so implementing about 100 methods sounds like a lot, but the thing is that those methods are very, very basic. So in many cases, you have something like your in-memory representation of the architecture database. And then each of those functions is maybe two or three lines of C code, just doing a little bit of point arithmetic to look up the right thing in the memory and then read whatever the thing is that you're querying with the function. Because there is no getBell info function, for example, that would give us all the information about a bell. Instead, for each attribute that you might be interested in, there is a different getThe method. And because of that, we have a lot of methods that you need to implement, but each of them is relatively simple and small. Okay, so say you would like to implement your architecture back-end for next PNR. You would like to write this 100 methods, but first you need to figure out how to store the information in memory that makes up your target architecture database. And we think there are three main approaches how to do that. But the nice thing is we have a very generic architecture. So if you identify a fourth method that's not on the list, then hopefully you can hide this fourth method behind the same API and everything is fine. So the three methods that we think are relevant are a completely flat database. So that means we have an explicit entry in our database somewhere for each and every wire, for each and every PNR, each and every bell in our device. And that means if the device is twice as large, the database is twice as large. But it also means if my device is super irregular, then we can also just implement it here and have a flat database for a very irregular architecture. And we use a flat database for I-40 simply because the I-40 chips are small enough so that it makes sense to just have a flat database. But as you can imagine, when your chip becomes larger and larger, your database becomes larger and larger. And at some point this is not really feasible anymore. So there are two different approaches to solve this issue. And the next step would be to have a de-duplicated database. The basic idea behind a de-duplicated database is that we first create a flat database and then we look in that flat database for regularities. For example, a wire with a certain layout or structure and maybe there is a separate wire with the same physical geometry, just one tile to the right. And then we can launch these two entries. And we only need that once in our database referenced from two different locations. The nice thing about a de-duplicated database is that you can still start with a flat database. So if your chip has a couple of irregularities, you don't need to do anything special. Whenever there is an irregular structure on your chip, it will just fail to de-duplicate it and everything will be fine. But the downside is your chip must be at least small enough so that it's feasible to generate a flat database in the first place. And then we can de-duplicate it. And for really large chips that's not really an option. So for really large chips you have to use a different approach which I call a tile-based database. And with a tile-based database you say this is the tile structure behind my chip and I have a small database in essence for each and every tile type. And then I just have a large table that tells me which tile in my chip is of which type. And the nice thing here of course is if your device is twice as large the database is only a tiny bit larger. And if you have two different devices in your in your chip family, the database that covers both of them might only be a tiny bit larger than the database that covers only one of the two. And yeah right now we are using flat database for ICP-5. We use a de-duplicated database and the long-term goal for support for Xilinx 7-series devices is a tile-based database. Okay, there is a Python API. Who here has written things in Tickle at one point in their life? Who feels bad about it? Yeah, so we decided we don't like go with the industry standard. We don't have Tickle, we just expose everything we have to Python. That will make it easy to do prototyping of algorithm stuff like that on the one hand. You can just try stuff out in Python fast and if it really works then you can see if you can make it perform better by rewriting it in C. But we also use Python for constraints essentially. So there are some simple constraints that you can set like this clock constraints here that just tell tells the tool what the clock frequencies on different clocknets. But you can also write actual code in Python that stuff like iterating all the cells in the design and then checking if those cells have a certain attribute set. And if they have a certain attribute set then add them to a certain placement constraint. So I think that this is much more flexible than other approaches especially if you have larger designs and constraints where it's really hard to explicitly write a list saying this should go here, this should go there. Instead you would like to have a small program that just looks at the design and then extracts the placement information from other hints that are already there using things like cell attributes or just hierarchical names of objects in your design. So what are the next steps for the project? First of all of course we would like to slowly replace Rachne PNR as the main placement route tool for IS-14 project ISTOM. I think you're actually quite far with this. Many of the things that use project ISTOM are already switching to next PNR. We don't want to be too aggressive about that because when everyone switches at once we get like all the bug reports at once. But I'm fairly happy with the rate of progress for moving people from Rachne PNR to use next PNR. Well of course there are many many ideas that we have how to further improve our placer and router. Some of them are driven by concrete needs that we for example look at the results that we get on Xilinx and then say okay we need to fix this and this and this to make it work as good as we wanted to work and other of those improvements are just ideas stuff that we would like to try out. And of course this is also something that we are looking for other people using this for example as a basis for academic research and many other things. So if you have any any ideas how a placer should work, how a router should work and you would like to try out if it's feasible to actually build a placer or a router that uses this kind of algorithm then hopefully next PNR is a good framework to do this kind of experiments yourself. Yeah and of course support for more architectures the whole point is that we want to have something that is portable to many many architectures so we will just add slowly one architecture after the other and we hope that long term we can do that with actual support from vendors so we don't have to start by staring at bit streams and figuring out what individual bits do just to support stuff in our place and route tool. Yeah and once we got that hopefully we will have world domination for the small world of FPGA place and route tools. Yeah some comparisons between Arachne PNR and Next PNR. So this is Next PNR bench right now there are only like a couple of designs in there and these are like 10 runs with each tool with random seed values and then the average. So you can see on the left column in this table in the center the maximum frequency on average Next PNR has 30 percent better timing result than Arachne PNR so that's definitely an improvement but on the right hand side you see on average it's like 50 percent to 100 percent slower than Arachne PNR I guess that's the price you pay for actually doing everything timing driven but I have something more about this in the demo as well. So this here is Next PNR against Arachne PNR so we still do backfixes in Arachne PNR but for the most part Arachne PNR is like a pretty stable artifact so we are comparing the quality of results in terms of maximum frequency against Arachne PNR on more or less regular intervals you can see more or less whenever I have time to run the script and yeah you can see it's kind of improving so I guess if we can keep up the overall rate of improvement for like another year then we will produce designs that are faster than physically possible so yeah we can't keep up the rate of improvement for another year we are already pretty close on on what can actually be done okay it's time for a little demo for this I need to change my settings a little bit hopefully that works okay yes so we can both see the same thing now hmm this is not large enough how is that good I can only see the people in the front and the people in the front tell me it's good so so the first thing I'd like to show you is this here so that runs users to run synthesis for a really really small blinky design and then we run next PNR to do the place and route for that and this will create a .ac file that is a file format used by ice storm and then we use ice pack to actually generate the bitstream so these three commands are the entire flow from verilog to a bitstream and it's a small design so it's just like this blinks on an ice stick um so who here has already had any experience with a commercial FPGA flow okay so I'd like to to imagine if you take the commercial tour of your choice and run a blinky design from verilog to bitstream that you can program the entire flow how long this will take and we will compare it what you have in your head with behalf here so that was a third of a second um and some people will say yeah but this is super small design that's not the thing that the industry is interested the industry is interested if it takes like five hours or ten hours to do place and route for a really large design but I think that's exactly missing the point the point is that every application is different and then you have an open source flow then you can optimize it for whatever whatever aspect is important to you and if you have a close tool then you just have to live with whatever the optimization goals are that the vendor is setting for you okay so that was that so this is now a much larger design it's a pico avi32 a 32-bit RISC-5 SoC running in an HX 8K I did already run the three commanded out commands before so we can go directly into next pni if I when I run that thing here okay so this is a HX 8K 8000 LUT device from the iSport series we can zoom in here I can select individual things here like nets bells so what I did now was I run the render packer this icon here and you can see the design that we have has about almost 2000 cells next thing in the flow would be to assign a timing budget and of course there are more fair fine-grained ways to do that but now we just say make the whole thing run with 50 megahertz and it's just all the clocks are now constrained to 50 megahertz then we need to run the the placer and can actually watch the placer moving cells around the placer algorithm as we use it here as a two-stage process in the first stage we leave the relative placement constraints for stuff like carry chains looser and then with strengthness iteratively and then we reach okay I can't scroll up while it's running so then we reach this point here where we legalize them and make sure that the carry chains are actually intact and from that point onward we always leave all those constraints intact and then we end up with the placement and the post placement timing report and at least the post placement timing report looks pretty good for 50 megahertz target here I'm quite happy with that yeah then we run the router and the router is fairly fast as you can see and we are done now we can if you want write a bitstream file whatever and we are done why is it not so this is the ASC file we just wrote and then we can use the ice stone tools to convert that into a bitstream file okay so I guess with that I'll open it up for questions thank you very much Clifford so we have ample amount of time for questions please line up at the microphones and remember to have them closely to your mouth so we can understand you so microphone one please is it like a popular or is it more like a realistic so we are one of the projects that we're pursuing right now is looking into using SAT solvers for placement right now we don't really have working code for that my personal preference as for solver as that would probably be glucose or kryptomanesat um oh yeah yeah I recognize you from your twitter microphone one again well since the tool generates the serve of blob of logic in the middle won't the FPGA overheat in the middle while being basically cold on the outside so wouldn't it generate a huge temperature gradient no um so the temperature gradient is not really there are reasons why I would like to uh to have things fanned out a little bit more or wider area um specifically to help with without ability um but especially if you give a timing constraints it like tends to put everything closer together um but I mean this device maybe I would guess it's less than a square millimeter die size and it has the temperatures should not really very much also this is a ultra low power FPGA so it shouldn't get warm anyways at least that's what the vendor promises a question from the internet yes thank you can you modify final placement in the GUI interactive not in the GUI so you can can you unplace things in the GUI that might be possible I'm not sure where is the cell bound cell no I mean I can I can hear right what is it unbind the bell and then the name of the bell with just like this here um x16 y17 logic cell 6 I guess and now the bell is available so if typing this python command counts as in the GUI then yes you can do things in the GUI okay any more questions from the internet yes have you thought about utilizing hardware acceleration as open cl or something I didn't quite understand the question having acceleration like open cell yeah I mean I'm not quite sure how it relates to place and route because usually when you do open cl acceleration you would like to run it through something like a high level synthesis tool then you end up with verilog and then you run that verilog through your regular synthesis chain or do you mean using it for place and route so we did have discussions about that but nothing concrete right now so microphone two please hi so you said place and route is completely timing driven in vivado there are also strategies like runtime optimized or area driven are this also possible so the area driven stuff is usually more in the synthesis tool there's not that much what the place and route tool can do um and I would say most of the things we do right now on the synthesis side with yoses is actually more area driven because right now we don't really have to the timing constraints in in synthesis so the more embarrassing question for me would be do we have timing driven synthesis in yoses yet but luckily this is the next piano talk okay another question from the internet no thanks okay I would say this concludes this talk on the q&a session so let's just okay close with there is an area and over in hall two next to the hardware hacking area it's the open fpj area we have white umbrellas that have open fpj written on them so if you have any questions just come find us there and we also do beginners workshops with these ice breaker boards okay a lot of stuff to be done so have fun with that and let's give a warm hand off a pass to our speaker again