 Thank you everyone for coming. Today I'm going to tell you a story. Every year people want more and more powerful machines. Ten years ago building a more powerful processor was something quite easy. You just went out went up with the frequency of the core. That's it. But today it's no longer possible. We do have limits at the level of technology. We cannot do that anymore. So what people started to do is people started to put at the beginning two, then four, then eight cores in one processor. And it doesn't seem that we are going to find another solution. So it seems that we are going to go up and up with the number of cores. So today I'm going to tell you about the port of Linux to an architecture with really, really many cores. Before I start, my name is Marta Rybczynska. I have been working in an embedded area for something like 10 years. I come from networking and security background, but currently I'm mostly doing kernel-level work. The work I will be presenting was in a part done by me, in a part by my friends. My friends at the operating system level, at the driver level, and the compiler in the debugger, and also my hardware friends doing the processor itself. So at the beginning, I'm going to introduce you to the architecture. Then I will move to the Linux port itself, first at the level of the core, so just the support of the core itself, and then the peripherals. Debugging is pretty important, especially when you are running multicore. So I will show you our debugging methods, too. And of course, all the future plans for the port. So let's move to the architecture. The chip is called MPPA256, because it has 256 cores that are available to the user. It has some more, and all of those cores are using the same instruction set. So it's completely homogeneous. And in fact, the processor has many interesting features. So it has low power consumption, it has nice performance, and it's very predictable. And also it has a lot of external interfaces. And what is even nicer, it exists. This board is a pretty big one, because we wanted to put nearly all of the interfaces. So we have the PCI, we have the internet, you have all the other stuff out there. Now, here is the floor plan of the processor. When you can see that there is that big block in the middle, that's the matrix. We call it the matrix. The processor has 16 so-called cluster groups of processors in the middle, and it has four smaller clusters at the periphery of the chip. Everything is connected by a network on chip. When we are going even deeper, one compute cluster, so one of those in the middle, it has 16 processors for the user. I want for the system core, one system core. The 16 processors, they share memory. You have also, of course, the interface for the network on chip, you have some debug capabilities. On the ioclaster, you have quad core of four cores that also have some on-chip memory, and they also have the access to all the peripherals of the chip. Now, this is schematic of the processor pipeline. If you do not understand everything, it is not really that important. What you can see, that first important thing, that it's all the same for all the cores on the chip. It is a core that was done by us. It's our custom design. It's a five issue, a very long instruction word. Very long instruction word. It means that it can execute, in this case, five instructions in one cycle. It has a floating point unit. Oh, maybe I will go to the blocks. It can branch, of course, branching. It has two arithmetic units. It has a multiplication unit and a low-store unit. In fact, some of the operations can be done by multiple blocks. It has many instructions you would expect from a DSP, so the MAC, or the MAC types, for example. It has a floating point. It has some specific instructions to operate at the bit level. Some shifts, permutations, things like that. It has hardware loops, so the loops that do not have any penalty when you're running them. Every core has an MMU, and you also have the control on the power levels of the core. Of course, hardware is nice, but we want to have some software to run on the hardware. What we are currently delivering to people is the standard toolchain you are pretty well used to. We have the GCC port. We have the bnewtils, new leap. We have GDB port also. We have our own simulator. That's pretty accurate. We'll see that later. We have a hardware system trace mechanism. The operating systems, what people are using currently, is a custom runtime running on the Q&P cluster. Think of the OSV that was presented yesterday. It's pretty, it's similar in many areas. We are running currently RTMS port on the cluster. RTMS is a real-time open source operating system. We also have drivers, especially the PCA driver. We have many different programming levels, but I'm not really going to talk about them, just about one of them, because this will be easiest to understand for the Linux people. In the programming level, each cluster is represented and is visible like a process. Your program starts in the cluster. It can run 16 processes, up to 16 processes. That starts on the cluster. In the cluster, you can do trading between each thread we run on a separate core. We do offer the communication mechanism between the core in the form of an IPC mechanism. At the trading level, you can either do p-tread yourself, or you can be helped by the compiler if you use OpenMP. I think I do not have to tell you at this conference why it's nice to have Linux running on a chip. In this particular case, Linux has a very nice and performant SMP. It has all of the device drivers for all the peripherals we may connect to the chip. It has a very nice L and well-performing network stack. And of course, we have all the legacy applications on the user libraries they have already. So now let's move to the Linux port. Here, I've put the kernel versions we have used. As you can see, it's a pretty recent port. It started in 2.6.32, 33. Then we have moved quite rapidly to 45 and then 3.2. At the point, at the 3.2, we have added SMP support and the first specific drivers. And in fact, the moment when the SMP went 3.2, when we had the 3.2, was the moment when the chip came from the fab. And we ended run for the first time on the real chip, on the real processor. That's pretty good. And then we have moved to 3.10. That's the version we are currently using. At 3.10, we've added the tracing support and the MMU. So our kernel, it can run on a single core or you can run in the SMP mode. We are running device trees from the beginning. And we are running generic headers when possible. That was pretty easy for us because the port was started when all this infrastructure was already present. So we've used that. We have the user space with the pick model and we have some traces. Of course, we do have drivers. We have the general ones, the generic simple things and some specific, the PCI for example. At the user space, we are currently building with build routes. We are on tool chain and micro-CDPC. The core is pretty easy to port Linux to, in fact. That's why I'm not really going to tell you about the details of the port because it was pretty straightforward. There are no modifications needed in the generic Linux to make it running. We just had to fill the Hatch K1 directory in this case. But what I'm going to tell you are some issues we've run into, as some things we worked on that I find interesting. So at some point at the SMP port, it started the locking. And the spin lock ticket was showing there's some corruption going on. The value was not coherent. So what we are doing when something like that happens, the first thing is that, of course, there is a bag somewhere. So the first idea, okay, we have a lock ordering problem in our platform code. First idea, okay. So we have looked at the lock that are in our port, but nothing. And then for the first time, we've enabled the spin lock debugging. We didn't have that enabled before that. So that was a very good moment to add all the things necessary. We enabled the spin lock debug, some work necessary, and nothing. Everything was right. So the situation looked not that good. So we've run our simulator. Fortunately, the thing was also corrupting on the simulator. So we run the simulation trace that was several gigabytes of size and started analyzing what's going on here. And the bug, in fact, was very simple. The exchange function should be returning the old value, and it was returning the new one. It worked for a very long time. One line fixed, and it works now. Another example, the atomic operations. Atomic operations are used in many places to do some things you need to synchronize. If you are on a single core, it's pretty easy. You are enabling and disabling the interrupts. And it's pretty cheap on our core. So there were absolutely no problems with that. Then we moved to the SMP. On SMP, you have to use a processor support. In our case, it's components-wrapped operation. When you are using it, you see that there is impact on the code size and the speed of the kernel. The impact on the code size is pretty visible because the spin locks and all the things that are using some atomic operations, they are all around the place. So it turned out that optimizing the atomic code is a very important thing. So we've done multiple iterations of our disk code. The credit limitation we have was designed with the hardware people, in fact, together with them, to use the specific features of the cache and of the write buffer. As we have learned many things, we have also improved the design of the atomic operations we have for the next version of the chip. As an idea, that's how an atomic operation looks on our architecture. As you can see, there is no inline assembly out there. There are two macros to do the visibility operations. In this case, there is a store atomic store. And why we do not use inline assembly? We do not use inline assembly if we can avoid it because our processor is a very long extraction word. You have an example of assembly code for our code here. You can see clearly that there are some groups of the instructions. Each of those group executes in a single cycle. So it's pretty important to put the groups right in the code. Our GCC compiler handles that pretty well. If you can do it manually if you want. For a short code, it's something that you can do easily. But if the code is getting more complicated or if you are going to change your code, as you need to handle the dependencies between the things you can do in one cycle, it becomes more complicated. So as a result, we do prefer the built-ins in our port because the assembly in lines, they are put as a single bundle. So we are potentially losing three instructions we can do in the same cycle. So as a trivia, our port has probably least assembly of all the other ports in the Linux tree. So now let's move to the peripherals. The first peripherals I would like to tell you about is the interrupt controller. In the case of the IOClasto, you have two levels. You have one interrupt controller in the core itself. And you have the other interrupt controller that is shared between the four cores. I know how it works. You have four interrupt lines that come from the shared controller to each of the cores. And you have interrupt lines from the some of the peripherals that are going directly into the core for such interrupt line that's, for example, a timer. So what we are trying to do in our port is to deal as much as we can at the core level. So to deal with the interrupt controller that is pretty close to the core and the operations are fast. But as we have many peripherals, we leave to the user to decide how they are going to connect the interrupts of the peripherals they have. And in the future, on the boards they will build, they may want to connect them in many different ways. Then the memory, how do we access memory? The processors in the IOClasters, they have access to two memory zones. There is a big DDR, the external memory, and there is a small memory that is inside the chip. There is a big latency difference between the two. So you can optimize your code very well if you put some of the data or some of the code in the fast memory. But to make the thing a little more complicated in that, the shared memory is also visible to the host by the PCI Express interface. So you also have to handle the PCI side of things. So how we have handled all that is that by default we do all allocations in the shared DDR and we have another allocator running for the shared memory. And now the PCI. PCI in our case is our main interface at the moment and the driver, we do provide the driver for Linux on the host side and the driver is a subject that could be a subject of a presentation on its own because it's a big driver. Why is it big? Because it's doing really many things. It's doing the boot. It's doing the DMA transfer. It's doing the protocol between the software running on the host as the software running on the MPPA. And also each MPPA has two PCI interfaces in fact. So the driver is also handling that. There may be two interfaces that come from the same physical chip. At the host side, we do not have any problems with the Linux frameworks. The Linux frameworks for PCI are stable and mature. Their high quality is pretty easy to use them. It gets more complicated on the device side because we do not really have a framework when Linux is the PCI device. What we are doing, we are trying to reuse the code between the two drivers, of course, especially in the part where we are implementing the protocol. And also we have a part of the code that is in the bootloader because when we are booting by PCI, we need the two. The PCI interface allows you to map a part of a device memory to the host memory. We are using that for the shared memory of the air clusters. We also use the mapping for some peripheral registers. And the big DDR is accessible from the host by DMA transfers. And we do have to deal with the cache effects on both sides. And a part of our protocol of communication is the notification who is using which area of memory to be sure that we are flashing the caches as we should. Currently, our main method to boot the chip is by the PCI. In this case, we load a small code to the shared memory. We start the processor. It runs, it initializes all of the peripherals that are necessary for further work. And then it sends a notification by PCI to the host. And the host then loads the final image to the shared memory to the DDR. And then it starts the processor. Again, and for example, Linux starts on the chip. That doesn't end here because we also do have to boot other clusters, the compute clusters. The images for that are put also to the DDR. And that's IO cluster that's doing the work to send the images. Up to this point, I've told you about the subsystems that we have stable and that they are working well. And you will be able to see them working today in something like one hour, I think, at the demo session. We are currently working on some other device drivers. The network on chip driver, it's a very important one. As the network on chip is high bandwidth, but also the main method to communicate between the processors. It is very important and it is also used everywhere. So we are using it in the counter space to do the boot, to do certain drivers, and we also need to allow the user space to do the IPC, to be able to communicate between the parts of the application. Currently, there is no API in Linux that supports a type of a network on chip interface. If you are maybe working on something like that, we'll be very happy to hear from you because we'll have to propose some API for this. So we'll be very happy to hear from someone with similar experience to have some interface that works well for everyone. All the communication goes through the network on chip. This private memory for each cluster, it's completely configurable in software. It can, but it has to pass through the NOC to do that. I've told you that the NOC is very important. It's also important because some drivers are using it. We do have up to eight Ethernet interfaces. They are standard, they are MAC as any other Ethernet device. They have a file and they are also using the network on chip to push the data to the rest of the chip. What we are working on is some kind of a distributed network stack where you can run certain protocols, certain applications, use their own connections in different areas of the chip. So we need to redirect the right packets in the right place. For the software to deal with it. Now let's move to debugging. We have a number of debugging methods. At the high level, we do have hardware traces. Here you have an example of a Windows boot with quite many trace points inside. I was just looking if the scheduling is working correctly in this and if it's scheduling what I wanted it to do. There was a presentation yesterday about tracing of the MPPA. You can still download the PDF of the presentations if you are more interested in that. So we also have support for the PrintK. We cannot live without PrintK. On your right, on this side, you have the run, this is PrintK by JTAC. On the other side, you have the same moment in the JDB. We can run the code in JDB from the first instruction. So here, as you can see from the trace, we are in the middle of the kernel. And then you have another run that's pretty similar. But that is done in our instruction set simulator. You do have a trace that goes out of the simulator that has values of all the registers when there are operations of them and the cycles when they happen. So it's pretty helpful to do the low-level debugging if we have to. So when we are debugging normally, we are using all of the methods. We are mixing them all together. So there are certain things that were difficult when doing this port. First thing was that Linux documentation is great, but it's less great when you're porting for a new architecture. There are big blocks of the code that do not really have any documentation. There are some great examples. For example, the atomic operations, they do really have a very nice documentation describing everything. That was pretty nice. But in many places, we just had to look into other ports to understand what a specific function is doing and implement and why it's doing that. And implement that in a such a way that it will work well for us. As we are startup, we are doing our processor. We are also reaching limits of our own software. So when I'm debugging Linux, it does not really always mean that the bug is in the Linux port itself. It may be somewhere in the other place. From the beginning, we are trying to be mainline compatible to get included when it will be possible. That means that from time to time, we are not doing dirty hacks. We are doing proper solutions. One of the examples is that we've seen one of the architectures having some comments about one of the new architectures. Having comments of them not using device tree when they wanted to match and we started using device trees. It was just at the beginning, so that was pretty easy to do. It's also pretty hard to split some parts of the work on porting of Linux between multiple people because there are too many dependencies of the things you need to do. And it also, the Linux port, it takes time to do. We needed some operating system to run on our processor to test it. So that's why we reported RTMS fast. But it was not all difficult. I find it quite surprising that the debugging in fact wasn't the hardest part. In a big part, that's because of the tools we have, of different levels of tools we have. So I can choose the right tool at different level of the problem. Another very nice thing was the existence of generic headers. So we are using them everywhere when they are available. I think that they are very good help for people doing the new ports, and I really like that. Also, there are some architectures that have been merged recently, and the code is less complicated than the code of some other architectures, some more ancient architectures. So that was a pretty nice source of examples of how to do things. So what we are currently working on is to finish the support of the drivers for the Linux port. They are running in their RTMS version at a minimum. So we are porting drivers between the operating systems, in fact. We need to finish the work on the MMU optimization, and do some optimization on some function that we can implement nicely with our special instructions of our code. Currently, we are using generic headers, and we can gain on performance by doing that. We'd also like to do a public release soon, and we are really seriously aiming at mainlining the architecture. So if you do have any questions, you can ask them now. You can send me an email if you are interested, and before we go to the other sections, I'd like to invite you for the demo session during the demo stance. You will be able to see it running on a real hardware, and I can answer some questions now. Interesting software, will I be able to run? So the question was when you can buy one, and once software, you can run on it. If you want to buy one, I think our sales department will be extremely happy. That's pretty serious, you can buy that board now. But if you want to buy 10,000, no problem, I think. It will be no problem for them, too. And what kind of software you can run on it? The processors are generic purpose. They have all the capabilities. So in fact, you can run nearly everything. What you will be able to see today, running really on the chip, as a demo applications, mostly the visual ones, so video processing applications. That's something we can show you running currently, but we also do have other examples. For example, we are running some networking applications because we have many network interfaces. So in fact, whatever you imagine. And it can be programmed in C, so there's no problem, really. If a core application you are thinking of making the CPU, I mean, I know it's general purpose, but there must have been some particular targets you are having with application domains you have used before. Or it's like Bitcoin mining or whatever else. So the question was, is the video processing the main target? Or maybe we are running other things, too. So the video applications have some nice properties. They need a lot of processing power. So they can use the processor pretty well. And we have some experts in the domain who are really porting the algorithms to this. We are also running some cryptographic operations, so if you are interested in Bitcoin, I know the numbers for the shell algorithms myself pretty well. And the other things, yeah, you can run them, too. To the CUDA. So the question was how it compares to the CUDA. When you have this processor, the applications we are running on it today, they are running at between 5 and 10 watts on the whole processor. I think, yeah, that you can compare it with GPU that takes a little more than that. And the processing power, you can send me a question. I will check the gigaflops per second number for you, if you want, because I don't remember it at the moment. 200? Okay, maybe. My colleague says 240 gigaflops. The question was, what frequency it runs? This version runs at 400 MHz. Linux implementations. Are there any areas of Linux kernel that are being stretched by your architecture that might need some improvements or changes to work better on your boards and your processors? So the question was that our architecture is pretty unique. So do we have some problems with some kernel subsystems? The answer is that at the core level, it's not really that complicated. At the core level itself and at the processor groups itself, there are really not many problems. It's getting a little more complicated when we are in the device drivers. When, for example, the API for the network and chip interface that for us is absolutely critical to have it a very high performance because that's what's driving the chip. So we cannot compromise on that. So we are pretty aware that we'll have to propose an API for that because there's nothing. The question was if the network stack is adequate. We think that yes. If we are running multiple experiments currently and if we find something that needs tweaking, we'll be definitely posting some patches on that. It works something like a processor with a bigger system. Any model, yes. So is it possible to build a standalone device with the CPU that puts itself behind us? So the question was if it's possible to build a board with this chip that puts on standalone. And the answer is yes. We are booting currently by PCI because that's how we designed our first testing board. There are other boards coming soon. That will be booting differently. In general, we can boot from Flash also. Without any external things on the board, this processor can run a completely standalone. The question was how to do the load balancing on... So the question was how we are scaling it all because there is a paper showing that the scalability of Linux has its limits at a certain level. So the answer is that currently what we are running is the Linux on the peripheral side, on the full course, full course for Linux, no problem. And in the compute matrix, we are running in custom runtime that's running one thread all the time. So we do not have the problem of... We are not running a scheduler on it. We have enough cores to give each thread its own processor. You didn't write any problem, I think. So the question was in this case, you have to do the programming a little bit differently. When you want to program a multicore application that works effectively, you may start from your sequential application, but you will have to find a parallelism in this application. There are some that do exist, some tools that do it automatically, but normally most people do it in a pretty manual way. There are tools that help you in that, but currently I do not know of any tool that will be totally absolutely automatic. So you have to... In fact, you have to think how your application works to make it work well on multicore. It's also true for X86 with 16 cores, for example. The X86 cores. So the comment was that in this case, there are a little more than 16 on the interface. So I would say that simply when I'm running an application for an PPA, everything that can be run, that I think as a little independent from the other things, I put a thread simply. And yes, and you do have a Dataflow programming model that allows you to express a graph of the things, of the operations, and then the tool does the parallelism for you. So let's... Okay, say I have 256 cores. If I would write in bozics or however, Linux, 256 processes, would I need to do anything extra? So this thing runs on this course? Or is it the standard Linux programming model? So the question was if we write 256 processes program, do we have to do changes in application? So currently you have to change the IPC communication a little bit. Say they are independent. I make a hell of a world with 256 processes. So the question was if we have a hell of a world, 256 other words. That's 256 other words, no changes. You just have to write the program for the air cluster to boot all that. But that will be a loop. How many processors do you see? One that can provide you the... It depends. One or four, depending if you're running single-core S&P. Well, that's exactly what I'm trying to get. 256 hardware thread executions. I mean hardware execution threading system. But it's not time to read the schedule. The question is if there are 256 hardware execution units for the kernel. And how do you map it into the normal Linux system? So the Linux is running on the peripherals. And the scheduling for all the rest is in the libraries that are dealing with the network and chip. The question is if the Linux does see all the other processors. Currently it does not. We may be thinking about running Linux also on the compute nodes. So if I do a PS, I won't see them, basically. So how do you load balance the network traffic? So the question is how do we load the balance network traffic? So that depends on the programming model you use for your application. In some models it's done automatically. It runs some equations to calculate the balancing. And then you're sure it's right. You can calculate the traffic yourself to be sure you have the right levels in every place. Or you can use the default policy that simply you send. And normally it works quite well because the network chip is pretty fast. So normally you want to find problems. And every cluster can communicate with every other cluster also. Do we have any other questions? The question was does it share something with ST200? We are based in Grenoble area. I think that answers your question. Okay, any other questions? Okay, I do not see any. So thank you and I see you at the demo session.