 Everybody I'd like to introduce John Sinoff to you and he is Going to explain to us how he ported netpst to the latest micro 32 soft CPU and how he had to modify the CPU before Being able to run netpst Good morning everybody. Thank you for attending so as Martin said going to talk about What I did as a hobby project recently and it involved Parting net BSD and HBSD which has been presented yesterday On an open source CPU called the lattice micro 32 So okay already been presented. So I will be joining M labs in a few days Which is was formerly known as milky miss community and which has been incorporated recently by its founder So that's a cool company doing interesting open source hardware and software stuff like the milky mist board if you Out of it or the milky mist system and shit. I will talk about it a bit later So okay, let's talk about this port and more specifically ported on the milky mist one board and First part I will talk a bit about the hardware. So Talk about the MMU and then a bit about software But first what's the milky mist one? It's this It's an electronic device aimed at generating video effects In which are in real time synchronized to a lot of sources like for instance audio input or MIDI or DMX events Etc. So it's a kind of a an artistic device That can be used a bit like this so you can film someone a performer and then you project against the wall at a party or at a concert and You apply nice real-time video effects like rotation zoom in zoom out or blurring, etc and you can interact with all kind of Party like devices like MIDI keyboard, etc So that's the device and it produces nice effects like these ones. Those are screenshots of the device output And the cool thing about this device is not just any close source commercial device because it uses an FPGA as its main components and most of the interesting part of the Device functions are implemented inside this FPGA. So it's not fixed in time you can play with it. You can modify you can do Your hack stuff on it and it's pretty cool. So What's an FPGA? So it's a chip basically like any other but you can Configure it to behave a bit the way you want. So it you've got logic blocks Located as an array and IO block and you can configure each logic block to do something like logic operation like add or XOR or stuff like that and then you have a switching matrix That allows you to interconnect those logic block and you decide which block you connect to Which other and then you can create basically a logic circuit and You can do this a lot of times and you can Basically implement almost for free an ASIC chip ASIC chip could cost usually thousands of hundreds of thousands of dollars So this The PCBs is an electronic circuit, which is also open source So it contains this FPGA and on the FPGA it runs the milk in the system on chip, which is also open source So it's a whole bunch of Function blocks which are controlling I don't know for instance the USB or the sound or MIDI DMX the UART etc rather DRAM or the frame buffer so all All those controllers are making the device work and then on the bottom left You've got the micro 32 CPU. That's the the main component. I will be talking about it's the the soft core CPU Which is running on this FPGA and which will be running that BSD So okay micro 32 so so little bits have architecture. It's a risk Big Indian that six stages. This is fully bypassed. So it's Pretty okay in terms of performance and you can put caches or not you can disable them and they are Up to two way set associative and it's using the wishbone on chip bus, which is an open source Specification bus which is used by all the open cores community of open source hardware cores Devices so you can easily find a lot of open source cores which are Talking this protocol this own chip bus protocol got tons of them So okay, and the CPU actually works on pretty much every FPGA because it's device agnostic It doesn't use a specific blocks forms one specific FPGA vendor for instance So you can for instance run this CPU on the D zero nano, which is running an Altera chip or on the milky mist one running Featuring an a spot and six FPGA from Xilinx Papillopro Or the mixio, which is a new project from M labs video mixer also open source and Or even the new Kintex 7 series from Xilinx So hey all of these could run and can and run actually run those days the micro 32 So why is it interesting? It's small. It really really doesn't take a lot of resources inside the FPGA It's all as I said, it's portable and it's pretty fast A lot of CPU open source also that you can find on the internet usually they run at 28 25 megahertz and This one can run up to 200 megahertz on recent chips It actually works and there are a lot of CPUs There's a hobby project that you can find on the web, but most of them a lot buggy if they actually do work in the end So it's one of the few that works and it does an okay software support We've got the GCC be new to GDB even QMU, which is really handy And it's open source No to the bad points It had no memory management unit, so I couldn't port net BSD But that's one of the few that was the first thing I I fixed I worked on this with the Mikael value. It's a two-man job and CPU is used in some closed source commercial asics, so it's pretty solid. It's been implemented in real Real chips and it can achieve a pretty decent frequency in quite all cell process so It's a standard Risk pipeline So just get the address computations and you've got the instruction fetch then decode execute loads turn right back So I think all of you already understand all of these So, okay This is a simplification drawing of the CPU The original CPU how it was working. So we've got the pipeline on the left Then you've got the two instruction and data caches on the right and then main memory is just off chip and originally there was no MMU and No notion of virtual addressing so physical address was used everywhere but directly talking almost to the RAM chip and There were no translations, so it was okay to run UC Linux for instance, which does run some extent, but As I understood it netBSD couldn't work in such a system. So We modified it so we added this memory management unit so that on the left pipeline only Manages virtual addresses and then it gets translated and goes through to main memory as physical addresses so in this way we were able to To have a something to run that BSD So, okay, but a bit about the MMU's job So it basically translate a virtual address into a physical address and it gives you also memory protection so For instance, it can allow you to say, okay, the stack is not executable and okay This is data so you can write through it or maybe this is read-only data so I don't want to allow you to write through it and It gives you this kind of security so We explain how it was implemented in this case. So This is the same diagram, but even more simplified so we've got CPU pipeline Giving some virtual address for memory management unit and it gets translated But okay, how does the MMU knows how to translate this virtual address into the physical address? How does it work? It's using okay, it's using the page table, but the page table as you can see is located in the DRAM so We couldn't access each time The DRAM to do a translation that will be terribly slow. So What was down, oh, no, okay first, sorry, why are we taking talking about page pages and First before I was talking about addresses. So why do I switch word? Because in reality You're not just translating addresses because if you were just doing a translation from one address to another Addressing like this. So for instance the address 4 to the address 1 and all 0 5 to 1 Or 1 you would need to remember all those translations as you would there would be a Lot of data to remember you would need to remember all All those translation line and for to translate for gigabytes you would need like for for billions of lines So it can't work like this. So To be more efficient use pages and then you take big chunks of memory and you translate those big chunks by what we call pages so And in this case where it's implemented as four kilobytes. So By just having one line you can translate four kilobytes from and it's more efficient like this That that's what everyone is doing basically So, okay So we're accessing this page table in RAM, but we're not doing it all the time because it's really slow Dram is really slow So we are going through that what's called TLB translation look aside buffer And it's just a cache like the instruction of data cache, but it caches Translation of virtual address to physical address so this way if TLB contains information you're looking for in just one cycle you get your Physical address translation and you then you can play with it. So it's pretty handy and this is a non-chip cache So it's really fast access Okay, and Then Okay, and then in theory The one when you are accessing the TLB, and it doesn't contain the information you're looking for because as a cache is just a subset It's pretty small You should have a way to then access release a page table in the slow DRAM to get the information and refill the TLB and Usually hardware does this so the TLB should directly go to main memory and Do the refresh but here in the design which I was chosen only for to be to be easier to implement and for simplicity We're not doing this. So the TLB won't fetch from page table directly. There is no hardware page through Walker So we'll need the help of the operating system here so in fact if the information is not in the TLB it will trigger an exception and For trapping to the operating system and the operating system will have to fix the stuff so to go read page table find somewhere to do it even if The MMU is off etc. And then once information is there To refill the TLB to sort of a date to put the mapping inside and then to resume What's what was going on? So it's a bit like what most mixes are doing or even poor PC bookie and It's not the the most efficient way, but it's easier to implement like this So the TLB is entirely managed by software Okay So as I said the features of this MMU Only use four kilobyte pages So it's not configurable. You cannot say okay. I want one megabyte page size and some so Quick question. I've got a 32 bits physical address If the page size is four kilobytes, how many bits of the address indicates the offset within a given page? well Depends what you mean by this Yeah, if you do a byte access you can Yeah, to to indicate enough set within a page. So it's the lowest bits There's alignments when you access DRAM Yeah, is there is byte addressing inside micro 32 and since it and you can be An align just for the byte addressing can be anywhere. So here's the usual case There's no trick in the question so So yeah, it's as usual. It's it's 12 bits for the for the 4k So I've got the page number on the left, which is 20 bits and the offset within the page Which is 12 bits So there are two TLB is to turn two software sort of caches for the translation one for instruction one for data and they are pretty big but it's to cope with the slowliness of Having to do it in software each time information is missing so we want TLB to be really big So that most of the time for lucky information is not missing and it's in the TLB so There are 1024 entries or ten bits to index those it will be Useful information for later So as I said no hardware page three workers, so it's still to be assisted So okay, we are something a bit like this So we feed a virtual address we feed if it's a load or restore a string share of data and then the MMU will answer okay, this is a physical address and Okay, I grant you the access for instance or I deny it and But here in this case since it's still be assisted We need a way to say to say I don't know because maybe the MMU doesn't know the answer And then that's where the software operating system part kicks in So okay now let's have a look inside how the TLB works a bit So I just walk you through it by Translating a virtual address So let's take for instance this virtual address a triple zero one double zero four I've only showed the three first line of the TLB so line index is just handy For the for the talk, but it's not actually any formation inside the TLB It'll be only contents of tag physical page number a read-only bit and a valid bit So how does it work? So first we split our virtual address into the page number and offset in the page The offset is not really useful for now. So we're going to put it apart Okay, so we've got our virtual page number that's That's the thing we are going to translate so we need to process a zero zero zero one How do we do it first we? Write it in binary and then before I said the TLB is undecided by ten bits. So We took we take those ten lower bits of the virtual page number and Those will index the TLB. So that will be So that will choose the line that is interesting To us. So here it's one. So we will choose the line Number one. So that's where maybe the information is so then Okay The valid bit is one. So okay, it's at least a valid information Then we've got also the information that it's a read-only mapping. So if the access is writing it will be denied for instance Then we've got the physical page number Okay, that could be our answer and then we've got this word information. It's a tag Okay Why I'm say I am I saying maybe it's information we're looking for because You saw that I only took the ten lowest bits part of the virtual page number to address this TLB so whatever the value of The first bit here, I would have chosen anywhere the same line the line number one. So we have Kind of a fight for this line. It could be a translation for a lot of Virtual addresses. So we need to check that that's will really the one we're looking for And to do this we're using the tag information here those ten bits. So we're taking the value 280 I would translate it in binary and when comparing we are doing the tag check and Okay, we see that it's the same So that's how we know that it's really the formation we were looking for and then the physical page number column will contain really our result so The physical page number is the P triple zero one and then to get really is the entire physical add-on we append the Page offsets So, okay, we've done the translation. That's basically how it works At the moment in the micro city to CPU plus MMU Except that now I needed to add an address space ID to the TLB to to make it work so But that's basically it It's okay enough for the Hello parts. No Just report to you how I am Managed to progress and Actually running that BSD kernel So first a very cool thing that I enjoyed is that Everything is kind of cross-compilation So I could work for instance on my MacBook and macOS and There was no issue with it. So I could just run this command And it would just generate for me a cross-compilation toolchain that runs on macOS But targets like this micro city to architecture and it was really handy Build.sh is doing it for me. It's pretty awesome too. I was really surprised. So I had to Hacked some make files there and there to to get it to to work because the architecture was not supported in the tree yet, but it was really Really useful Okay An issue I had was that The kernel is not linked against LibGCC So I had a lot of missing symbols when I was at the end trying to link my kernel And I was wondering why and in fact, okay, and it's too late. It's not linked with LibGCC So every time there was a multiplication or division of modulus operation was in instead of doing Inline the code was inserting calls to those utility functions and they were not linked with it So instead the kernel is linked with LibGCC and okay I learned that I had to go to this C-slip LibGCC and add the directory for my architecture and put it put there the The utility functions to do the mathematical operation Okay, then My first goal was trying to at least get a binary image to get it to link Even if it didn't work because first to to get some binary to try to run it and then try to debug So I just tried to to feel the include and configuration and Directories by copying a bit. What's what did what exists in the other architecture trying to understand how it works Sometime at first I didn't understand exactly What I was supposed to put so I kind of copied and and then for all the missing symbol I'll just put stubs because I really wanted to be able to to get this elf image And okay try to run it and then debug one by one all the issue and write the missing pieces so first up everything run this really simple command and Then in the end when I got the elf I could try to really implement something First thing I needed to debug. So I needed to print something on the console. So I did the very basic Console driver only for early prints and it's really not difficult to do You just declare a struct where you put your callback function for reading a character of writing a character to the UART and Then it will be used later on So I needed to implement of course the exception handlers that's the the first function is the reset handler which is executed in this case and then for That's the code executed when you've got a nioq or TLB miss exception or this kind of stuff And then At startup it's called the milky miss startup C code Then it initializes the console driver by using the previously written structure. So if you remember this milky miss com comes And it's pretty easy. It's just you assign to CN tab Pointer to this structure and then the whole system knows that when you do prints By de-referencing a whole bunch of pointers, it will end up in your in your structure and you will call your Your print function. So by just assigning this structure, it's okay Then you can do prints and you can do easy debug So that was convenient Then pretty early in this milky miss startup function You need to initialize the virtual memory subsystem. So you do this by calling the machine dependent PMAT bootstrap function That's basically where you register The physical run which is available in the system. You say to UVM, okay I've got that much pages and starts here and it ends there and Okay, you can deal with it and you can do a location on this pool And then when basically those are very simple Initialization is done. You can call main function, which is machine independent and it's basically a very long list of subsystem initialization calls So then you're in the in the usual net psd kernel so But then you need to implement a few stuff like P-Map the P-Map subsystem as I said is the kind of the virtual memory system It's not so straightforward to implement But the really good news and the really good surprise I had is that I didn't have to do it and that's pretty cool The less you write code the less bug you have so since Mico 32 is software managed tlb and there is already code for that thanks to Matt Thomas I all I only had to use the files in ccv and P-Map which are already used in a poor PC bookie and Okay, the P-Map system system was done I only had to do the the first function which is just P-Map bootstrap, but All the rest was working fine so far So that's a good surprise then You've got some stuff like copy and copy all which are basically taking data from the user space process Copying it to kernel memory and the other way around Then you've got to implement atomic operations There is no atomic instruction in Mico 32 so and I by reading Basically the kernel code of NetBSD. I learned about this technique to the restartable atomic sequence for implementing the basic comparison swap, which is the Core of all the implementation of a mutex spin locks and all this stuff like this and so It works a bit like this. That's the actual code. I have implemented for the restartable atomic sequence so The interesting thing is that since there is no atomic operation in Mico 32 Obviously it takes here five Assembly instructions, so it's really not atomic at all. So how does it work? if I'm ever interrupted for instance by an exception or most likely an IRQ for in the middle of this supposedly atomic operation then in the return path when I will return back to What I was doing. So in this case this is a major operation. I Check if the PC return PC is in between the cast rest start symbol or the cast rest Cast rest and symbol and in this case instead of just returning to the PC to where I was Executing so for instance may be returning to the store world R1 Plus 0 R3 then I choose to return to the cast rest so I rewind I Go to start and restart again So I found it a pretty cool trick and I didn't know about that. So that's what I use Then you need to indeed at support for interrupts to handle them But even to let the driver register the interrupt callbacks So you need to be to do a bit of code about this and then you need the running system clock to Let the system be able to to schedule LWP is a thread stuff like that So so you need basically on the milky me system if there are two timers so far I'm using one of the two timers who do the system ticks and And there is a global symbol you need to implement. It's called CPU in it clocks and Its main goal is just to set up this so to active to initialize the timer to activate it to register the clock IRQ hundred and and do this kind of stuff and Then another thing is that your clock IRQ hundred needs to call a machine independent Symbol which is heart clock and that's how it plugs all together to to go from the machine dependent code To then the rest of the system to do the time accounting and the scheduling stuff So you need to plug to the heart clock system Then there is a whole bunch of all the functions that are not detailed because it will be really too long But you need to implement CPU switch to Which is basically the machine dependent function to switch from one LWP to another so LWP is lightweight process if I'm missing and Then a k copy to copy data Setful to save current context and the fork operation to create process and Really cool stuff I enjoyed about the NetBSD environment and ecosystem is that the small nine in the parenthesis Is that for kernel code function you've got man pages and you don't have this on all operating systems So I found it really cool I wanted to know how to implement CPU switch to what it was supposed to do. I don't have any clue I could okay read the other architecture But I I just had to type man and CPU switch to and I had my answers and a bit of how it should work So that that was pretty Cool So then other function I had to implement so SPL is to block interrupts So that in some critical sections, you don't want to be bothered by any interrupted so you can block them and then re-enable them afterwards and Okay, CPU start over basically a lot of machine dependent function. Okay Then I wanted to try to boot user space when the kernel was fully booting So to do it I had to create a dummy RAM disk containing only one binary which is in it Then I built the kernel with the MFS option, which is memory 5 system which allows you to embed inside your kernel image a RAM disk Basically, so a file system, which is which will be in RAM You insert in the kernel and then you try to boot it. That's the approach I had and That's basically where the process the progress of the project is right now It's the kernel boots and it's booting and it's running the init Which is a really a small handcrafted statically linked crappy init just printing a low world, but I find it Cool, it's at least runs. So, okay time for the demo as I'm saying it runs. Let's try to see if it's true Okay, so let's just compile it for the fan Just touch one file and then so I'm it's pretty easy to compile or just use build.sh and Should just compile all of the bugs Okay, so then You've got the kernel here Then you can use QMU which is really handy to debug to run this kernel So you say you want to run the milkimi system on chip and you select the I8 maybe it's a bit small Okay, is it better like this? Yeah, okay, so Select milkimi system and chip You say you want to use LM32 with full divider and multiplier and with MMU You don't need a graphic and you select where the kernel is so Select this kernel going to happen gdb and Okay, let's run it with gdb and That's really handy because that's basically how I debug debugging All the problems I had is that there is a gdb server inside QMU You can just attach it and look at all memory registers, etc So let's attach Okay, that's more what okay I forgot to do something. I just build the kernel, but I didn't put the ramfs inside So if I just hit continue it should just okay, so you see at least the kernel boots So you've got Okay, I've put a lot of debugging prints, but basically at the start it does the pmap bootstrap and it registers Using UVM page feed load it registers Available RAM So you could ram size you say it registered a bunch of pages in the sub system and then it calls Okay, then it It initializes a pmap module, etc. And then you call main. I've put a few prints of the main What's the main doing so basically it initializes all the subsystem and Then it's trying to initialize all the drivers so for now there are very few rival only the timer Then the UART and The clock which is the other timer and it's doing the clock ticks then you turn on interrupts and then it tries to find a rootfire system which is Not ah, it's not managing to find rootfire system. So it's not booting So now let's try to put a ram disk so so here I'm creating dummy fire and I'm putting Dev console character device I'm creating FFS and I'm compiling my init statically linked in it inside the ram FS and then using MD set image I'm embedding the ram FS inside the kernel. So now it should boot a little bit further so Let's run it again First let's see what's inside. Okay So you've got the main our may be in C. Okay, that's basically what's the init is doing It's really really almost nothing. It's just opening dev console For STD and STDL, STDR and it's writing a hello world. So for now, it's really a simple init stuff And it's statically linked so all the write and read and open functions are just in line assembly inside the init So that's what we are going to try to run. It should print hello world so There is a bug where after init it crashes so to prevent this I'm going to put a breakpoint on the last instruction of main Just for the beauty of not having the crash printed. So So this is a virtual address And now if I show QMU and I do continue Okay, so then you see what's more is that it found a Root file system which is of type FFS which was in the RAM file system then a bunch of warning And then it's loading init and it's running it and you've got zero That's it and then if you really want to see it crash you do continue And now you delete the breakpoint and you do continue and then you've got it Okay, so Okay So that's just a memory layout Which is I think it's the pretty usual one so three gigabytes for user space of virtual memory and one gigabyte for kernel space So on the milkems board there is only 128 megabytes of RAM And the interesting thing here is the RAM window I will explain a bit what it is So basically All the physical RAM has been mapped In this RAM window which is in the kernel address space So, okay physical RAM start as physical address 0x4 or 0 and basically all the beginning of The kernel virtual memory so starting at 0x4 all these beginning so from C0 to C8 It's in fact a direct mapping to the physical RAM So by using these visit physical addresses you can basically access all the physical RAM And that's really handy for some stuff So for instance here is how I'm Managing the page table. It's the pretty again the standard usual way. So you've got a page directory And a page table So if you want to try to again see where these virtual addresses you to composite and got You split it into the 10 bits 10 bits and 12 bits. You take the top 10 bits Which are three so you index and you use the third Row in the page directory. It gives you the address of the page table. You see C4. It's the same order and then you take the following 10 bits and Okay, one zero. It's it's true So you take the second line and this will give you we will give you the the physical Address and again you happen the offset Okay, this is a this is a data structure in the kernel to Really manage and remember all the mappings all the virtual to physical mappings but The kernel is running with memory the MMU on so it cannot Access to which to a physical addresses it could it can work So Okay, you think okay, I will put only virtual addresses in this structure this structure Okay, if put C4 to point to my page table and it's I am okay But I need also to be able to access this data structure from the TLB miss under and in micro 32 It's running with MMU off. So I cannot do the reference is kind of pointer and that's where the RAM window kicks in because Since it's a direct mapping. It's really easy to translate from one to the other by just doing a subtraction on addition so So I choose to put Virtual addresses inside the page table and everything so for for the kernel it's okay, but When in exception handler, I just do this computation So I can walk this page table from the exception alone with no no issue because I know how to translate So that's that was the interesting part about this Okay, it's pretty much done. So if you want to follow progress about this part It's really just the beginning since it's just Loading small in it and there is no leapsy port yet, and it doesn't run bunch of user space stuff So this is a static web page and giving the the progress But I try to update also the wiki page on HBSD website Which shows you all what's working what's not working and how you can bootstrap the stuff So get the the correct git repository Compile the tool to the cross-toolchain compile the gdb compile the q special QMU port etc So everything is explained to directly start developing on it Then I'm using I'm using git basically so I'm all my code is on github But I imported all the netbsd source without history. So it's not an awesome way of working So a better thing is the head HBSD git repository which has all the history which is based on your Mirror and that's there is a Short you ever about this So okay, thank you for listening and also thanks to all those people and More that I forgot who really helped me a lot in this not so easy task, but really cool and full of learning stuff Thank you Any questions? I think your TLB is very large. Yes compared to others Is this do you notice did you experiment with smaller one and see how the performance is affected? No, I didn't try actually that's a good question Yeah, it's pretty large, but what's cool with FPGA is is that I get it mostly for free Because when you're implementing this kind of cash is you're implementing basically SRAM inside the FPGA and You have special hardware blocks usually to implement those SRAM and anyway You're usually not using them. So when you do logic a Usually you do logic Multiplexers etc and you don't use the SRAM block. So they sit there unused so anywhere you can use them and You can get bigger caches and And yeah, it's cool other questions Could you explain the process of synthesizing the CPU core and world-building process from the scratch very long source or VHDL source to Building the kernel and a bit stream Is it possible to build a world soft CPU with your MMU extensions using only free tools? for example without Xilinx ISA or other proprietary Verilog synthesizers Okay Well indeed FPGA world is a bit sad because there is no full A to Z Toolchain which are open source So if you want to do real FPGA development You are forced to use a closed source vendor dependent a toolchain like as the Xilinx ISC or the Altera one etc So now if you cannot synthesize your system on chip Well, okay, you cannot get to the bit trim to the bit trim file with only open source Toolchain, but there is a bit of work in toward this direction and actually You can synthesize so only the first part of the of the work because you need to synthesize and you need to map to the Block technology and then you need to do the place and route etc. So the only the first step Of the pipeline it can be done an open source You can synthesize that is micro 32 or even all the making is the system on chip using your uses Which is an open source synthesizer so thank you and Another question Is it possible to run that CPU in simulation in for example a car was very low? Yes, it does work and it helped me a lot to debug the MMU when I implemented it I used several simulators, but I very low also and Okay, it's simulated also by QMU as you could see but it's not at the gate level But yeah, you can run it with the carries. Okay. Thank you so So how much work was it doing the MMU implementation? well In fact, yeah, it's in theory. It's pretty straightforward to design an MMU since it's just Well a cache with a bunch of logic around but in practice since I was really a beginner It's kind of my first FPGA project So took a year to implement and to test and debug etc But in theory, it's not that complex, but I was doing this project to learn. So I had my own learning curve about this Other questions, okay How much time did it take you to do the NPSD port? Well Well, as I said, I'm only doing this as a hobby project. So in my spare time and it took really a hell of a long time Maybe I'm not really good, but yeah, I started beginning of 2013 I think so it's one year and a half. It's yeah, it took a long time It's just one hour there one hour there in the night or in the subway and Yeah, it's really long to work like this. It's not an eight hours straight hours a day It's easier to work to work when you've got like two or three hours straight because then you've got to remember what you did before Etc. It took a long time More questions. Okay. Thank you