 Yeah, I just activate my connection. We'll see if the Wi-Fi goes down during your talk. Right, so how many people have used more than eight cores in their system? Hands up. So everyone's used eight core servers or workstations or eight threads. Okay, so my talk is about how you actually break that limit and how you actually have more cores than a single server. So I work in a company who develops custom interconnect for doing this. So what if you had an application that consumed a lot of memory? What if you had more data than you can fit in one server's memory capacity or more data than you can put it on drives in that server? Then you need something which is bigger. So actually how does the world solve this? So what you do is you develop some custom silicon, which I'll talk about in a minute. First of all, what do you actually benefit from having a really large server that has perhaps thousands of cores and simply one OS, one Linux instance? Well, it's easier to manage. You only have one system. You also have, you share all the resources. So you can have one RAID array and all the data on your storage. This is done in storage area networks, but you can do it locally. So and if you want to run large applications which crunch a lot of data, simply you don't need to modify them. They'll just work with all of the cores. Linux will just give the applications more cores, more memory, more disk. So more GPUs as well. So how do you do this from an architectural point of view? Well, simply on each CPU you need to program the memory routing tables so that memory outside the local server, if you have a rack of servers, memory outside this server goes via one chip. And then by doing so, you can have basically not infinite but pretty large address space. So in this case, you can also connect them, you can interconnect them in such a way where you have one hop between each islands it's called for each server. And you have a chip which actually acts as a basically router. Kind of like how IP routing works with the internet. So that's what's needed for these things. So you can extend that beyond a simple 2D network into a 3D network like this. You can have such a large cache-coherent system, each of these squares being a server in a rack. So you can have really, in this case, 3 by 4 by 4. Now to do that though, you need lots of ports in your cache-coherent interconnect. This is our one of our older ones here that has six ports. Each port you put in a pair of cables and you have bidirectional communication in X, Y and Z axes. In this particular chip, which is custom silicon designed for this, you have about 1,600 nanoseconds round-trip, which sounds fast but actually is so much slower than the DCPUs themselves. So there's also scaling issues with how software operates as well and also how you optimize the software to run fast. Now on these systems, so both silicon graphics and on these numerous scale systems, which many of our customers use, they all run Linux. So actually Linux single handedly is the off the choice. Windows doesn't go beyond I think 16 sockets these days maybe 32. So how does one add support in Linux for thousands of cores? The X86 architecture has eight bits for each core to address each core, which has a unique APIC ID for handling interrupts. So invariably you can add a software mechanism to extend this and to allow breaking through that barrier of eight bits. So the kernel supports 32 bits interrupt space. So you can then structure that into X, Y and Z axes like three or four bits per kind of position on each of those axes and then eight bits for the local motherboard. This shows how interrupts basically a route from one core to a different motherboard. So in order to use this new mechanism, which I'll show you later how it was introduced for using more than 255 cores, you then have to have custom hardware that responds to interrupts writes in memory, which can then emulate how interrupts work and then forward them over this cache coherent fabric to the remote server and then generate what is actually an interrupt. So we have, in order to achieve this you have to write a kernel driver that then maps into the system address map. So the local APIC is what's normally used for generating interrupts and this there's actually an additional area which here, let's see, interrupt controller which we map underneath that which decodes to our special silicon chip. So instead of writing to a local APIC we generate a write to our interrupt controller with our specific driver for our chip which then supports the Linux's 32-bit interrupt space and then that's routed and ends up being an interrupt on the remote server. So these are the areas here, like the DRAM areas here, these are all cache coherent. So these are mapped into the other PCI bars. I don't show different PCI bars but these are all distributed and mapped across different servers. So you literally just, when firmware is configuring this, booting your rack of servers to one system it maps all of the DRAM of different servers into the address space, into one global address space and you have up to Linux supports 32 terabytes here and for coherent memory and 64 terabytes for non-coherent. So this is the Linux kernel driver which supports our custom silicon for booting more than 255 cores and in essence it's very, very simple. You declare a struct at the top which then shows you what it registers, a function to then call when Linux wants to interrupt a different core. This function here down here basically looks up the 32-bit Linux APIC ID, the ID of the core and then it actually looks at the side of it. You can ignore this area here simply it then does a numeric APIC write which writes a 32-bit word into this memory mapped area and which then generates and interrupts across the fabric to a remote core. So this illustrates just how easy it is to write a kernel driver and how simple it is. It's minimal and simple. There's a few more lines but it's not the core. This is the core of the code. Interestingly, this area here is an optimization where if you're interrupting a core on the same server, don't go via our chip but simply go via the core itself. So generate a local interrupt. So that's the normal Linux interrupt mechanism which writes to this local APIC space here. That's decoded by the processor and then it generates a certain cycle type on the processor interconnect. So if you have a rack of servers the motherboards have different clock chips, clocker crystals so the actual time runs slightly differently like maybe less than 0.01% but over time it adds up and of course time isn't synchronized so you need some mechanism to have a global clock which the kernel scheduler and applications can use that isn't going to drift. So in order to do so we have a timer register and this is mapped here. Global timer writes local timers. So global timer is something that is simply a counter that is incrementing on all these cache coherent chips on all the servers and it's all synchronized whereas the local clock source is not on each motherboard and this is simply the complete driver except some init codes and some synchronization. Simply it reads from a register which is 64 bits which increments at 200 megahertz and then other parts of the kernel use this read function here which points to the actual driver function which is here. Now the latency of reading this clock is somewhat higher than reading a clock actually in the CPU but clock accuracy is very very important so there's no way to avoid that. Now my demo system in Norway is down so I can't show you but this is Htop running on our 7th server demo system. Now these green lines here it's hard to see but each of them is Htop showing the CPU load. So I'm running an application here mg.d.x which is a computational test and it's using all the cores it's written in open mp so it just simply uses threads. So we get about 34 teraflops from 7 AMD Opteron servers and down here is what PS looks like but of course it's a bit small let's see if we can zoom in. This is Htop here and this is what PS would look like. So you have lots of threads here running. Actually this shows threads so okay the hardware supports arbitrary configuration so what you can do is you can get a load of servers and you can wire them more or less I wouldn't say randomly but in a way which is not via the x, y and z topology or what's called a torus topology but you can use arbitrary topology so in this case for smaller systems like the one on the previous slides you can actually use point-to-point configuration and I've written this nice JavaScript tool which you can vary the number of servers you have you can also vary the maximum cable length that you actually have and it'll constrain the actual wiring and then it'll generate a nice configuration file here which the firmware will load let me just demonstrate actually okay so here it is and it's showing the wiring topology the red circles here are the servers the lines between them are the cables that you connect and then so there's two different representations and you can highlight a server so as you're wiring it with cables then you can see what goes where as you adjust the number of servers it then also adjusts the topology since there's only six ports then on the adapter as you saw earlier if I show you yeah so this has six ports here so of course you can go direct point-to-point not on everything oops oops oops yeah so this is the adapter now if I go back to this and then yeah okay and then it also calculates how much cost as well it will be and the inventory list and then there's a small button here generate configuration you click that and it gives you a config file now you have a bunch of servers now how do you manage them from a central point well you have something which I developed in Singapore here which this Newman manager appliance this black box here running a bunch of image and that does that gives you a TFTP boot server takes the configuration file you give it and then it gives you a list of your servers and helps you power on everything together yeah so you can hear you can power on everything or you can power off reset that kind of thing so that was also something else so I went to North Carolina a year and a half ago to deploy a quite large system with 5000 cores 21 terabyte of memory using this interconnect so we run one of the most popular tool benchmarks for measuring total system performance is called stream it measures memory bandwidth and we got greater than 10 terabytes a second of d-run bandwidth from the cores which actually is the second at the point the second highest world record there was some kernel patches that were done by some guys at Seuss Mel Gorman which reduced the boot time from two hours to 15 minutes which was a huge improvement previously the d-run would be initialized from the first core it would memset all of the d-run in the page tables and doing that on all of the d-run across the interconnect was quite slow so doing part of it in parallel saved a lot of time I mean actually booting these things in a quarter of an hour is already quick compared to a lot of other systems the firmware is open source so it's on github and let's take a quick peek sorry guys so the firmware here is here and yeah we have all the source code for actually programming the chip, the registers, all that kind of thing so everything is here it's all nicely written kind of structured with kind of like C++ classes but then it's all compiled into C and bettered C so it works nice and it's easier to manage and then look out for me in the future with my talk on FPGAs on deep learning so the idea is this system is to take a lot of machines and make it look like one machine to the software yes exactly so it actually becomes one machine so linux finds all the memory, d-run, cores all the PCI buses, you do LSPCI and you get in this case you get 216 trees of your internet cards, disks, everything so it's all in one machine so what are the practical uses to share so what kind of inerties use this so one of the biggest customers right now is SAP so they have the SAP HANA system for enterprise ERP planning and accounting so Silicon Graphics, half of their sales is selling SAP HANA systems, also you get companies and universities doing research, doing with large databases and you get in memory databases so you need lots of local DRAM and also if you use SSDs, the latency of your SSDs versus a storage area network is so much lower being local okay thank you, thanks any more questions okay thanks guys, thank you Daniel