 Thank you. Thank you guys for having me. I hope you guys are having a good fostum. My name is Mohammad Osama and I'll be presenting Gunrock which is our graph processing library built on GPUs using CUDA. I'm a PhD student at University of California working with Professor John Owens and I flew more than I think 14 hours to give this talk so let's have a good go as well. So this is a link to the talk if you guys are interested in downloading it and following along. I'll give you a second to like take a picture or something. All right so just a second. So why use GPUs for graph processing? I didn't feel the need to actually sort of give a motivation of why graphs are important because obviously you guys are in this dev room so maybe you care about graphs but why how can you use GPUs to sort of do graph processing better or faster and how did we manage to do it? Well so graphs are everywhere. GPUs are nowadays everywhere as well. Your phones, your laptops, everything now has a GPU in it. Graphs require fast processing. It's need within the memory and the compute that you'd require to do some graph analytics. GPUs sort of give that powerful memory bandwidth and computation together so it goes really well hand in hand. Graphs are becoming very very large. There's billions and billions of edges in graphs nowadays within Facebook and Twitter and yeah there's a need to process a really really really large data sets. GPUs however are not good at that. We have a very limited amount of memory. 32 gigabyte I think is the most in an NVIDIA V100 and I'll talk about how we handle some of that some of those issues. There's some irregular data access patterns within graphs which makes load balancing quite hard and a really really difficult problem to solve so writing high-performance graph analytics and graph algorithms really require you to get load balancing right and that's really hard to do on GPUs using CUDA so we show a way to solve that as well. So what is Gunrock? Gunrock is a GPU based graph processing library like I already said which aims for performance. You can write state-of-the-art graph algorithms using our library. It also aims for general generality so it covers a broad range of graph algorithms and I'll give you a list of algorithms that we have already implemented and ways that you can implement your own graph algorithm so problems that you're interested in. So programmability which is really cool because you can implement your own graph algorithms and then extend it to go from a single GPU to a multi-GPU environment pretty easily using Gunrock. Scalability so once again one of the problems was the how much memory you have available within a GPU is super limited but the graphs are really large so there has to be a way to sort of scale that and still see the performance that you'd need from a GPU. To sort of start off where can you find Gunrock? Gunrock is hosted on GitHub for I think over three years now. It's an open source project has a lot of commits a lot of contributors and really active and a lot of graduate students that work on it too. So this is the link right here to the Gunrock's website where we have our docs and link to the GitHub page and all that stuff. Some of the open source sort of project workflow stuff we managed two branches on the main repository the master branch and the dev branch master branch is where all the releases happen the dev branch is where we push all our development work a lot of it is done within forks of individual people individual graduate students and then it eventually gets merged into dev branch and which eventually sort of pickles up to the to a release in in a master branch. It's hosted using Apache 2.0 license. We also provide code coverage using CodeCov.io central integration using Jenkins and lots of documentation using a Slate documentation for API and docs and also for API and some of the performance results and stuff are hosted on Slate so you can check that out as well. We have unit testing using Google test which is really cool and if you are interested in or you have any questions please give please post a GitHub issue or if you want to contribute create a pull request there's someone there will be someone that will review it or give you feedback on it and will be able to respond really quickly on that as well using GitHub. We prefer GitHub over emails. So everything related to gunrock is open source and online there's no hidden development at all even our roadmap is online so you can go and check that out if you're interested in seeing what kind of research problems that we're trying to solve in future or working on right now or the kind of graph algorithms that we aim to write and some of the interesting hive stuff as well is all under. Yeah so I talked about a lot of contributors bunch of commits and gunrock is also part of the NVIDIA GPU accelerated libraries and also it's being worked into being integrated into the NVIDIA rapids framework which is a Python framework that allows you to do data science and stuff so underneath the hood it will use gunrock and some other graph frameworks provided by NVIDIA. So let's actually see how everything works. I heard some people hint towards sort of different programming models there exist in graph abstractions that you can use gunrock relies on the data centric abstraction so vertex centric abstraction is another name for it and it uses bulk synchronous programming model so what is what is those two words actually mean? So when I say data centric we have notion of a frontier where a frontier is just a group of vertices or edges so if you have a entire graph you just create a frontier using a subset of those nodes or maybe an entire graph gets put into a frontier and then you process your whatever algorithm you're trying to write on this frontier you don't have the notion of the entire graph anymore so everything that works on a frontier moving forward. You have parallel operators this is what you use to actually do work on that frontier. We have some of our parallel operators but you are free to write your own as well. These are really high-performance CUDA implementations this is where all the engineering goes into. So some of the examples I have on the left side are advanced filter for intersection neighborhood use and a lot more so an example would be advanced where advanced will take a frontier an input frontier and then it will generate a new frontier by visiting all the neighbors so in the in the action of visiting the the neighbors you have access to all of the neighbors off the off a given node or something and then you can perform all sorts of different operations on that node and I'll give an example a concrete example of where advanced is used. Filter as the name suggests it just filters stuff out so you have an input frontier and you use some condition to filter some vertices out and you have a resulting frontier that's smaller than maybe the input frontier that you put in and a lot of documentation is available for all the other operators that there are and there's some that I don't mention here as well. So what is the bulk synchronous programming side of things? So the operators in themselves are highly parallel and they run on GPU and how you write algorithms is you couple these operators together in sort of a bulk synchronous way. So this is one operator you might have another one after that and another one after that and so on and then you run this in a serial loop so you perform this in a serial loop so after every operator runs there's a synchronization call and that synchronizes the entire data flow and sort of makes that knowledge available to every processing unit or whatever you want to call a CUDA core too and then all of this runs in a serial loop until your algorithm converges. So these are the kind of algorithms that we currently support there's some that are being worked on and so we have some of the traditional algorithms like connected component breadth-first search page rank is somewhere in there too SSSP is there too single source shortest path and then there we also have some of more complicated applications which require other algorithms to work. So graph trend filtering is one of them which requires max flow as it's one of its components to really perform the whole application. Another one is shared nearest neighbor which is a clustering algorithm that requires k nearest neighbors which is somewhere in here for it to work. So there's some really concrete real-world examples and some textbook algorithms that you can use to couple those together or you can need a couple different operators together to write your own graph algorithm all working on in parallel on the GPU. I also have an example application I tried to sort of simplify this code as much as I could but I'll go and quickly explain what each of those lines means. So the first step to implement any algorithm in Gunrock will be to actually implement the lambda functions of the operator that you're going to write. So these are C++ lambdas the entire library is written in C++ and CUDA. So what lambdas allow you to do is have the generality of like doing anything within a function so it's basically a user-defined function. Say you wanted to do a traditional advanced operator in Gunrock but you also wanted to attach some sort of notion to it some sort of like computation attached to it. In this example in a single source shortest path the algorithm is quite simple you have a single source and you want to find the shortest path to all the other nodes in the graph. So you queue a frontier with that single node and then you basically do an advance so that's why you need the advanced operator you visit its neighbors and then you visit the neighbors neighbors and you build this frontier of neighbors and neighbors and neighbors over multiple iterations and then you calculate the shortest distance to that node given a source node. So you take the distance of the source node that's the first line so auto distance equals distance of the vertex ID the the source that you're working on plus the weight of the edge ID so the weight on the edge going to a neighbor or yeah going to a neighbor and then you find the minimum between that and the existing distance that was in the source node and if it is the minimum the predecessor is set to the neighbor node and you move on yeah if a shortest distance is found you'll need some way of removing that vertex from the frontier otherwise the algorithm won't terminate because the frontier is is completely full and you keep doing it so it will be an infinite loop so you need some way of removing nodes that are already that already found their shortest neighbors so that's what a filter lambda does it's basically saying if it's still valid keep keep it in the frontier otherwise remove it remove it from the frontier. So lambdas are great abstraction to basically allow users to write their own computation within this and you can imagine using the same operators with different lambdas to implement a different algorithm entirely. One of the recent ones that I did was graph coloring where I used four lambda or you can even use a filter for that and graph coloring requires is basically a problem where you want to color different nodes with different colors if they share an edge and you can write that all within a filter operator but with a different lambda and all of those examples are online for you to see. So once you have written the lambdas down once you have actually expressed your algorithm you launch it and this is just a while loop while the frontier is not empty keep running keep running the advanced operator with the advanced operator lambda and the filter operator lambda so you provide the lambdas and you keep running it until the algorithm converges until the frontier is empty. And that's basically all I have these are some of the grants that we have that allow us to do this amazing research and I thank you guys for listening left a few a lot of time for questions I think. Okay. Thanks. There's some time for questions. Anyone? Yes. Benchmark, would you use this? Yeah so on the web so the question was do we have any benchmark where we compare our results to other stuff that's being done? Yes, we have online we have a web page of evaluation where we compare against different graph frameworks and on some of the famous data sets and one of the recent performance evaluation that I've performed runs over 74 different data sets which includes anywhere from like the Twitter data sets social networks road networks some of the structured data sets the RGG the randomly generated graphs so anything you can imagine in sort of the graph analytics world we have a data set for that and a test for that. So that's a good question so we use maybe okay how do we load the graph so we have we take matrix market format as an input and then we have underlying graph representations which are the common ones like CSR compressed parse throw CSC which is compressed parse column coordinate format COO and we also have a graph representation which you can plug your own graph representation in and then define that under our graphs data struct and will be good to use that going forward yeah so currently we support three and there's some work being done by rapids folk at Nvidia to support the I think the Apache arrow format within Gunrock and within their framework as well. So the frontier itself when it's built it's exact so you know what you put into the frontier to start off with but when you do operations like advance there's policies that lets you determine is it approximate deduplication or a complete deduplication where you remove any duplicate neighbors that you found in parallel or you might want to keep them for certain algorithms where it doesn't matter which will perform much faster because you don't have to do a synchronization and a deduplication of some sort so there's support for both so you can do approximate frontier handling or something or you could go aggressive and do exact yes have you found any algorithm any functions for which the GPU is slower than the CPU one of the ones is actually max flow so graph trend filtering is uses max flow and max flow we found that there's a CPU algorithm that the just maps so much better and it's not transferable to a GPU environment it's not parallelizable so the ones we the one we use I think it's the Edmunds carp or push relabel max flow and it just doesn't do as well as we'd want it to be but there's a student working on some research and developing a max flow algorithm for that but that's one example that comes to mind yes right so the question is about low memory availability within GPUs and you will obviously need like a high end v100 to get the maximum 32 gigs so what we also have is support for unified virtual memory which is a CUDA UVM driver that allows you to manage memory through CPU by page faulting so your graph will essentially you load as much as you can on the GPU and the rest of it is page faulted as needed but you can imagine that will have a huge performance penalty depending on how many page faults there are and how you're doing them there's also a multi-gp environment so you could buy maybe cheap cards and use the multi-gpu aspect of gunrock which is sort of hidden away from the user which allows you to do the same algorithm in a multi-gp environment with now more more memory available but obviously you'll have to spend money on buying more GPUs for that I have a question yes you showed the SSSP example um and you used these uh the values the input values for the lambda to store the result distances and the weights yes what are the constraints on those data structures can I basically pick a generic value or is it only a long or what so that can be anything that can be literally anything so c++ lambda allows you to capture any of the variables that you'd like uh so that those square brackets are the actual capture where you can capture the data you're working with so if it's an array it can be a float or double or int or whatever um and it's all the entire gunrock library is templated so it supports any of the data types that you'd like or you'd want to do um the I forgot to mention the sort of parentheses there with the dot dot dot it's uh I sort of simplified that for the slide but it's the signature of the lambda that stays the same no matter what uh so it's basically copy pasted from the api which provides you access to the vertex id and the edge id because that's how you get those yeah yes any more questions so I'm not the person that's working on graph embeddings but um you can have any sort of data struct or struct attached to uh node labels or edge labels and then you pass it in in the same way and capture it within the lambda and you should be able to uh you should be able to work on it that's my 20 minutes okay let's uh thank the speaker