 So, let's continue. I'm very happy to announce Stain who's working with Intel, and he's talking about programmable unified memory architecture, Puma, and how you do... Okay, thank you. I'm afraid there's going to be another hardware talk, but I keep it simple and hope that you appreciate the fact that the hardware is also evolving and that we're looking at more efficient processors for graph processing. So, as you all know, graph processing is getting bigger and bigger, and graphs are getting bigger and bigger, and we want to process more data. So, we set ourselves a... So, we found that we need to increase the efficiency of graph processing versus existing architecture such as CPUs and GPUs. So, to that, we propose the programmable unified memory architecture, abbreviated as Puma. And in this talk, I want to explain to you why graph processing is actually challenging on the existing architectures. What makes Puma fit for graph processing? Some high-level details about Puma and how it performs. And after this presentation, I'll probably get a question again about Puma, unfortunately not yet, because it's still under development. Okay. As you all know, Intel is the market leader of high-performance processors. And in these processors, we've implemented a lot of things for regular applications that work very well, such as branch prediction, branches and regular applications are very predictable, so we use that to predict branches ahead, such that we can flow the instructions more in a continuous way. We have caches that assumes that if you access some data, you will access it again in the near future or its neighboring data. We have vector operations that perform the same operation on neighboring data. And all that works good for regular applications. But graph applications, as the previous speaker also explained, are not that nice for these architectures. For example, many of the graph applications have many branches that are data dependent. So the actual outcome of the branch depends on the data that is in the graph and which is, of course, not predictable. So branch predictors don't work well. Data is also accessed in a scattered way. So that was also introduced by the previous speaker. So you access the neighbors, and the neighbors are not the next nodes in your list of nodes. They are scattered all over the place. So caches don't work well because you don't use the neighboring data or you don't use the same data and the same for vector operations. So we see very low performance on regular CPUs for graph applications. Many people have proposed to use GPUs. They are actually better in general because they have higher bandwidth and more threads. But they actually suffer from the same problems. If branch is diverse, so go to another direction, then the parallelism cannot be fully exploited. Scattered memory accesses prevents to use memory coalescing, meaning that all of the efficient things you cannot use in GPUs too. And we also saw before in previous presentations there is a problem of memory capacity and scaling out. So what are potential solutions? There have been proposals of graph accelerators so that specific chips made for graph applications that have some functionality, for example, that implement sparse linear algebra algorithms or vertex centric algorithms. The problem there is that you're fixed with that functionality. So if your application or your algorithm works best for within the sparse linear algebra, but your accelerator is a vertex centric operation accelerator, then you need to transfer your application. And you also need a host, so a CPU that controls this accelerator, and there is the data transfer between both. Another solution is to have a general instruction set processor, like the normal CPUs, but that's been optimized for graph applications, which is then more flexible, because you have an instruction set you can implement other algorithms too. It self-contained, you don't need a host per se, and that's actually what our approach was in Puma. Okay, I've talked about the challenges that graph applications pose. So how did we solve that in Puma? Most of the graph applications are very memory bound. So most of the time if you have a very fast core, as the young core, for example, it's just stuck by the slow memory. It's waiting forever, so you cannot use all of its speed. So instead, in Puma, we have much lighter cores, much slower cores, but we have many of them. So they still wait for memory, but because we have many of them, the total throughput is higher. As I said before, caching was a problem. So the problem is that if you load data, you only need that one element that you load, not the full cache length. So in a conventional architecture, you load the full cache length from memory through the memory bus into the cache and then to the core, but as you see only the blue points are used. It's very inefficient use of the cache capacity and of the memory bandwidth. So that means you waste a lot of things, a lot of potential. We have a lot of cores, but we also optimize the memory accesses such that you can access only a single element. We bypass caches because they're not efficient to get more chip area for cores and other stuff and get more fluid flow of memory operations. Another thing is the size of the graph. So if you have a very large graph, you need to partition it and put it into multiple nodes, multiple compute nodes, multiple servers. But then when the graph algorithm wants to access data on another node, it needs to go to the whole communication stack which takes a lot of time and unfortunately, graph algorithms are not very predictable in their locality. So it often occurs that we need to go off node giving a large performance penalty. So for POMA, we have this hardware distributed shared memory. So there is a shared memory across the whole system. You don't need to think about communication. The network is high-bent with low latency just to reduce the latency or the performance impact of remote accesses. Another thing that we saw is that there are a lot of common patterns in graph applications and we designed some offload engines that efficiently execute these patterns. For example, atomics are very much used in graph algorithms. But for a core, that's a very intensive operation. So you have to load the data, you have to load the data, update it and write it back and unlock the data. So the core is often stuck a long time performing these atomics. So in POMA, we have this offload engine that performs the atomics that relieves the core from performing these atomics. So just the core issues an atomic instruction but leaves the execution over to the offload engine and the core can continue executing other instructions in the background. So the operation is done in the background and the offload engine looks where the data is located and performs the update locally. Similarly, a gather operations. For example, if you want to gather some characteristics of the neighbors of a vertex, the normal operation is that you load the index of the first neighbor, then you load the data of the neighbor and store it somewhere. So that's a very intensive process for the core. So we have this DMA gather offload engine. So the core again just issues this DMA gather instruction and then continues executing. And in the background, the offload engine performs the necessary memory accesses without actually needing to move all the data to the core itself. Other operations are memcopy, barrier skews, and so forth. So going into a little bit more architectural detail of a single POMA core. So that's a schematic overview of a single POMA core. It consists of multiple pipelines. So each of these are pipelines. Each pipeline supports multiple threads. Why? Like the previous presenter says, said, most of the time we are waiting for memory operations. So if we are waiting for memory operations, we just switch to another thread. If this one is also waiting, switch to yet another thread and so on, just to hide all of these memory latencies. We have limited caches. You have an instruction cache, data cache, which is very small. It's for very local data such as the stack of your thread. The instructions executed by this course are a novel instruction set that we designed. It's based on a reduced instruction set, so simple instructions. And we added specific instructions that can be used by graph operations such as a single instruction in direct load. Because we have limited caching, we don't want to go to main memory all of the time for data that's been consumed and produced locally. So we have a manually accessible scratchpad, so the programmer can decide whether to cache its data to use a scratchpad or to go to main memory. It's part of the global address space, so other cores can also access the scratchpad of this core. I talked about offload engines, so they perform these operations in the background. We have a memory controller that's been optimized for these small accesses, eight byte accesses, and we have a network interface to connect to other cores. Also, of course, using these eight byte packets because most of the data is used in these small granularities. Okay, the full Puma system, as we envision it, is hierarchically built. We have multiple of these cores that form a tile, multiples of these tiles form a node, multiples of these nodes form a system, and that means that since you have already that many threads on a core, you have that many cores on a tile and so on, that the full system can easily consist of millions of hardware context for threads. Again, the global address space and the memory are shared across the full system, so you can access any data from any point in the system. Of course, this requires a very high bandwidth low latency network, so we use HyperX Network Topology. This is the textbook representation of this, so you have a hierarchical design where within each level, everything is fully connected, and then on the next level, there are connections between the levels. Furthermore, we plan to have optical connections between the tiles and between nodes again to increase bandwidth and latency. More interesting for you maybe is how do we program this beast. That's still work in progress, so we're also working at a software infrastructure for this architecture. Initially, where we did our first experiment with is a simple SPMD-based parallelism using C, and we use special intrinsics for the Puma specific instructions and LLVM for building a compiler. In the meantime, there has been increasing support for C++, P-threads, OpenMP, and some tasking, and we're also looking at implementing common graph libraries, other programming languages, and a Python front-end. Now, importantly, does Puma fulfill its premises, namely that it's more efficient to do graph analysis on it. Of course, the chip is not available yet, neither is so. We are also working at an FPGA model, but that's not ready yet. Basically, results that I will show are based on simulation, where you take a Puma binary, you have a functional simulator that decodes the instructions of the binary that actually simulates the operations, and I have the timing simulation that models all of the hardware structures, such as the cores, the memory, scratchpad, network, and so on, and that gives us a performance number and also interesting for developers a profile of how the execution looks like, where cores are idle, that's the pink part, or used, or waiting for memory, that's the light blue part, such that the developers can optimize, find the bottlenecks and optimize their application. We also have an analytical model, which I won't go into detail, because simulation is a very intensive process, so we can simulate up to a few tiles, but after that it becomes quickly infeasible, and therefore we use this analytical model, which we then validate using simulation on a smaller cores. Then we took a whole bunch of kernels, graph kernels, and applications. We run them on a high-end Intel Xeon server with four sockets, we optimize them if needed, and we also ported and optimized these applications for Puma, and the initial estimates show that this high-end Xeon server with four sockets will approximately consume the same power as a Puma node will consume in the future. So comparing their performance is also a energy efficiency comparison. We also did some multi-nodes on Xeon, but that didn't work well, because you have this communication overhead that the result was that we saw nodes speed up for most applications or even slowdowns. We projected to 16 Puma nodes, and I will show shortly this skills much better. Here is an overview of some of the applications and performance results. It shows the speed-up of a single Puma node versus a high-end Xeon server. You see that there is always speed-up, but there is a large range, so it goes from 200 times faster than Xeon. That's because some applications are more compute-intensive, so they do more compute, and of course there the Xeon can use all of its resources to efficiently do that. On the other hand, if it's purely memory-bent with bounce, such as random walks, then we see a very huge performance increase. And the 16 node projections also show that it scales much better than Xeon does. Okay, that was basically it, so I showed you that Puma is a programmable instruction set processor for graph applications. It contains many features which I discussed, and we show that it's true simulation and modeling that it's one to two orders of magnitude faster than a equal power Xeon, and it scales well to multi-node. And it's still in the development, of course. Okay, that was it. Thank you very much. A very interesting perspective on graph processing questions. Do you see that this will be sitting? Do you think that this will be affordable as a workstation, or would you say, no, this is a cloud installation that is somewhere in the server center, or where do you see? So the question is whether this is a workstation, PC, or desktop-based thing, or no, it will be more in a data center, or a big supercomputer setup. Is it possible, kind of going onto what he asked, to have these cores as co-processing cores? Do you have a few big normal cores and a few Puma cores? So the question is if we plan to have some, let's say host nodes, or yeah. We're still working on that. So there is an option to have potentially x86 nodes that do more of the other stuff that this is not efficient for, but it's still under development. Sorry. It's not in the same way as on Xeon, but I'm afraid I cannot go into details of how it will work. Sort of the shared memory architecture, and there's different sort of processing units. Does it matter where in memory is the location of the values needed for certain cores? So if they're closed by, they'll be a faster access, or it doesn't matter since it's an all-to-all connection, but what's the penalty of a data being further away in memory to the core that needs it? Okay, so the question is, is data that's further in memory, will it be slower to access it? Yes, of course, because it's a very large system. It's distributed across. This system will be distributed across multiple nodes, multiple racks. So there will be a larger penalty, but because of all of this memory latency hiding techniques and these offload engines that perform it very efficiently, you won't see as much as you see in conventional multi-nodes setup. More questions? So thanks again.