 Good afternoon everyone. I'm Aninda Menocha and today I'm going to be presenting on behalf of the Decades team our work on hardware software co-design for efficient graph application computations on emerging architectures. I'm going to give a brief overview of what the Decades project is. Decades is part of the DARPA software-defined hardware program which aims to design runtime reconfigurable hardware that can accelerate a variety of data intensive software applications in the broad domains of machine learning and graph analytics. So the Decades approach is to design a heterogeneous tile-based chip that is a combination of core accelerator and intelligent storage tiles as you can see on this image on the right. And this is a collaborative effort between researchers at Princeton and Columbia University. So all of our tools are or will be in the very near future open source at this link below. So many machine learning and data science applications need to process large amounts of dense data in the for example images composed of many pixels. Fortunately huge strides have been made in processing these types of data like neural network accelerators. Meanwhile graphs can efficiently represent big data although their data layouts are often sparse and so they require different computing paradigms. Due to the ubiquity of graph databases and data structures graph applications are at the heart of many big data analytics as you all know. For example recommendation systems. Here's an example of Twitter's use of a recommendation system so if a user goes to Fosdome's Twitter page they will be recommended other free and open source software. So in order to process big data modern technology trends have employed specialized hardware which has led to accelerator oriented heterogeneity and parallelism. As you can see in the graph the purple black and orange lines show the trends with these performance over time. And these have significantly benefited compute bound workloads but as you can see by the green line there's a gap between processor and memory performance and in the context of Amdahl's law as compute is growing faster the access the relative of memory access time is only growing slower. Unfortunately many graph applications are memory bound and furthermore these graph applications need to process data sets that are massive and continuing to grow exponentially so like the Twitter network contains millions of nodes and the ability to process these networks hasn't kept up. So we need efficient graph processing techniques that can keep up with our modern data sets. So in order to design efficient graph processing techniques we need to understand their bottlenecks and because many graph applications are memory bound we look at their data access patterns. So as you saw in the last presentation we were introduced to the idea of a frontier so we look at many graph applications that are front iterative and frontier based and this includes the widespread breadth-first search single store shortest paths and page rank algorithms. So what does it mean to be iterative frontier based? Well like we saw we have a frontier of nodes we have multiple iterations to traverse the graph and within each iteration of the graph or iteration of the algorithm we have a frontier of nodes that contains the IDs we want to process and then we also have this flat array node vals which stores the per node properties so depending on the objective of the algorithm we store a different type of data for each node so in breadth-first search this would be the number of hops away from our given source node and then on the right we have the kernel template for these iterative frontier based graph applications so I'm going to walk through this template in the context of the breadth-first search algorithm. So we're starting with our root node of zero and so for every node in our front peer we do a processing of that node and then we look at all of that node's neighbors and so in this case we will we do an update neighbor function on that neighbor and the exact details of this function depend on the objective of the algorithm in the case of breadth-first search we do a load to the node vals array in order to determine if the node has been visited and if it has not been visited then we need to store the number of hops away that node is from our given source node and so because these updates depend on the location of these neighbors these updates need an indirect memory access and this leads to as you can see in this flat array this leads to irregular accesses within this array so this is going to be the key thing to think about going forward and then looking at the nodes that have not been visited we add those to the frontier for the next iteration of the algorithm and then this process continues until we reach the next iteration where our frontier is empty so why are our regular memory accesses problematic well modern memory hierarchies are composed of multiple caches and caches are designed to store frequently accessed data that is stored in contiguous blocks so when your memory accesses are regular caches are not amenable to these accesses you can see this is highlighted by this sample memory hierarchy below as we have to traverse the memory hierarchy and as we miss at each level of the cache we eventually go off-chip to main memory and so if you recall in the previous kernel based in the kernel template the update neighbors function that performed the irregular memory accesses was inside a nested loop and so it occurred very frequently and so we define irregular memory accesses that occur frequently as llamas this is our acronym for them and so to quantify why llamas are problematic we look at five different graph and sparse applications and break down their runtimes into compute versus memory so all of the compute is highlighted by the orange bars and the memory accesses are broken into the llamas versus our non llamas in the yellow and as you can see the llamas are dominating the runtime for all of these applications this graph below shows specifically the llamas last level cache miss rate so if you look at all of these five different applications you can see that the last level cache miss rate is 0.5 or above which means that 50 more than 50 percent of the time these llamas are performing an expensive long latency memory access to main memory so because llamas have a disproportionately large impact on the performance of these graph applications our work seeks to specifically address them and thus we introduce our approach fast llamas so fast llamas is another acronym short for full stack approach and specialization techniques for hiding long latency memory accesses so this at a high level is a data supply approach that efficiently maps a graph applications onto pairs of producer and consumer cores and we have a programming model that can allow for more explicit mapping of these applications as well as specialized hardware support that can asynchronously issue irregular memory accesses and we do get results of impressive speed-ups from this and I will show those at the end first I'm going to give a brief overview of the decoupling technique so decoupling is the is a technique where a program is statically divided into two independent instruction streams one of these streams is mapped onto a producer core and this core is responsible for all memory accesses and the necessary address computation for these accesses and then the consumer core is responsible for all the value computation so these cores run independently and in parallel and so this creates a form of heterogeneous parallelism to illustrate the contrast these two execution timelines below show on the left homogeneous parallelism where the two threads the top and the bottom rows are performing the same types of computation and memory access whereas when you have heterogeneous parallelism one thread is responsible for the memory accesses this is the producer core and then the other thread is doing the computation so homogeneous parallelism is great when you have a very compute bound application whereas when you have a memory bound application heterogeneous parallelism comes into play so the main idea of decoupling is to tolerate memory latency and this is done by having the producer issue requests to the memory hierarchy and retrieving the data before the the consumer needs it and so these cores utilize a specialized hardware the communication queue in order to have the producer store the data before the the consumer needs to eat this data and this timeline on the bottom right that i pointed to earlier this illustrates how this can tolerate memory latency so there's a warm-up period where these two cores start running at the same time so the producer needs to gather its run ahead but it does this very quickly and once it's done this then once the the producer can asynchronously issue memory accesses and the consumer then never has to stall whereas typically these long late these memory accesses would be long latency and the consumer would have to wait waiting for this data because decoupling creates two different instruction streams that are independent the the original dependencies in the program are now changed and remapped so that dependencies are only respective to the individual slices so there might be a dependency on the producer's memory access but now this could be mapped onto the consumer and so when this does happen we take advantage of this with asynchronous accesses so asynchronous accesses are memory accesses where the data is not later used by the producer so this is where the producer can hand it off to the consumer and move on to issue its other memory its later memory accesses and so as a result it doesn't have to occupy its hardware structures or its pipeline resources and this is illustrated on the right where we have two different execution timelines the top one shows the scenario where there are no asynchronous memory accesses so as you can see that each memory access the producer needs to issue depends on the previous one and this leads to frequent stalling both on the producer and the consumer whereas when we have asynchronous memory accesses the producer can issue a request and move on to its next one without having to wait for the previous one to finish and following sorry following this warm-up period the consumer never has to stall as a result so now i'm going to talk about how FastLama's leverages this decoupling technique to tolerate latency so to illustrate or to provide a contrast this is the original kernel for the iterative graph iterative frontier based graph applications this is broken down into three high level functions the process node which is highlighted by the orange boxes and then we have the update neighbors which is our llama this is highlighted by the red boxes and then the conditional up conditional edition of nodes to the frontier is highlighted by the blue boxes and so when we execute this template on an in-order core we can see that the llamas are dominating the runtime but FastLama's decouples this program so that the process node function is mapped onto the producer so in this execution timeline on the lower left we have the producer is the top row and the consumer is the bottom row and then the middle row this wide row shows what's happening asynchronously in the memory hierarchy so this isn't mapped onto a core the producer and the consumer are the two cores running in parallel and so the producer does the node processing and then it can issue llamas this is shown by the small boxes with the init written on them so it can issue a regular memory access and then continue to issue on its next one and these are not time consuming operations and then the llamas are issued or running asynchronously in the memory hierarchy and then when their data comes back the consumer can eat the data and continue on with its respective functions so there's a warmup period again where the producer needs to gain its initial run ahead but then it continue from there and the llamas are asynchronously issued after this warmup period and as a result the consumer is never stalling waiting for these llamas and fast llamas is able to tolerate memory latency so this is a this is a relatively detailed hardware diagram i'm not going to talk about all of the individual parts of it but i'm going to go over the main additions that fast llamas uses to support this in hardware so we have an specialized buffer called the asynchronous access buffer and so this is when this is used when the producer issues a memory access and then this asynchronous access buffer can store the addresses of the in-flight memory requests and then when the data comes back from the memory hierarchy then this data is matched with its corresponding address and then the data is passed to the communication queue between the producer and consumer cores and then the consumer can use this data so when we have an asynchronous memory access it's issued by the producer sent to the memory hierarchy its address is tracked like i mentioned and then when the data comes back sometimes the data might be modified and this can be sent directly to the memory hierarchy or onto the queue the communication queue so now i'm going to show some results of this approach we looked at five different graph and sparse applications these are the five that i mentioned before with the llama graphs so we have two applications on top element wise sparse dense is a matrix multiplication between a sparse matrix and a dense matrix and then we have bipartite graph projections which are which is an algorithm that operates on a bipartite graph and it relates nodes in one graph based off of their common neighbors in the other one and then we have a vertex programmable graph processing algorithm so we use three of the most widespread algorithms breadth first search single source or its pads and page rank and the difference between these algorithms and the two above are that these algorithms require an explicit annotation by the programmer so our programming model supports an annotation that allows the programmer to explicitly guide mapping so it can tell the compiler that performs our decoupling to map and memory access onto the consumer and then the top two applications do not they can automatically be sliced with our compiler so going back to the decades hardware we have the notion of core tiles and so these core tiles can be reconfigured so we can have them as two parallel core tiles that run simultaneously or we could have a fast llamas pair which is a producer core tile and a consumer core tile and so we evaluate both of these configurations when we compare these two we could see that highlighted by the blue and the yellow bars in this graph which this graph shows the geomean of each of these five applications so we ran these applications on multiple different types of data sets a combination of real and synthetic networks but we're just showing their geomeans here and looking at the geomeans we can see that fast llamas outperform traditional parallelism by up to 2.7 times and then because graph applications are memory bound we look at an in order core with the perfect cache because this provides a performance idealization as if every memory access had only latency of one cycle and so we can see that looking across the board at the orange and yellow bars fast llamas is able to achieve up to 96.2 percent of perfect cache performance and then looking at fast llamas compared to our baseline performance which is that of a single in order core tile we see up to a 5.32x performance improvement here but when we looked at individual application input combinations we saw up to an 8.66x speed up so this work was supported by DARPA as I mentioned before and so in conclusion fast llamas is a hardware software co-design approach that tolerates latency on graph applications with its programming model its compiler that can perform the automatic slicing into producer consumer pairs and the specialized hardware support for asynchronous memory accesses our team members uh so decades is a large effort between prison and columbia so our team members are listed here and then you can access our applications our compiler and the simulator we use to get these performance results at these links below and this is also being implemented to be designed on our chip so the RTL is in progress but that will be available soon as well that's it thank you very much um any questions um so the question was can this architecture be or mitigate latency in uh depth for surge and strongly connected components um so this uh we'll we'll see the most impressive speed up so when you have these llamas these like long-length memory accesses dominating the performance um this could be I guess it depends on the implementation of the algorithm but we study the most work efficient ones where the long-length memory accesses are exposed so I think in depth for search the long-length memory accesses are not as much of a problem there but it could work so I think you're asking about cases where the consumer needs limited windows that you know that till when you can privilege some memories and my question is that how do you completely remove that decoupling between them you decouple the memory accesses in computation and I need to then get that part how completely remove that you can and that says outside of the oh so are you asking if there's like a limited window which the decoupling can apply um okay so the when we do the decoupling the compiler so the program is sliced so the compiler the producer has it sees its list of memory accesses to um issue one and it's not so this is where the programming model actually comes in so we have an annotation in our programming model that can map or tell the compiler to to put certain memory accesses on the consumer and so in that case we would leverage that annotation so that the consumer could just do these memory accesses and not have to wait for the producer