 Thank you Cherry. So very pleasant afternoon to one and all present here. So I mean we all know that with the increasing amount of data that we collect we want to process the data as quickly as possible. So this is one step towards that direction. So today my talk is going to be about HPXEL which is asynchronous integration of GPU computing with HPX. So homie I am Madhavan Seshadri. I work at Visa Singapore. I graduated from NTU last year and I've been working for the Stellar Group under which the HPX and HPXEL development has been going on since my Google Summer of Code participation in 2017. So I also have contributions in the vehicle routing domain. I've published a few papers and a few journals and conferences. So what is our main goal? Our main goal is towards massively parallelism. Like we want to make sure that we can execute as many tasks as in an as parallel manner as possible. So but as such we are inhibited by the complexity of new languages and various programming techniques that come with different sets of devices as well as the versions of the devices. So most of the compute power today is split across GPU devices, CPU devices as well as other and each type of GPU devices have their own coding language. For example you have CUDA, OpenCL and whatnot. So another main challenge that we are facing right now is the architectural integration of heterogenic devices under one single framework or one single umbrella is is difficult and it's oftentimes it's inefficient than what the native implementation provides us. So we as such we need to integrate diverse programs under one single umbrella and we want to maximize the utilization of the available processes as much as possible. So before I go into depth about the design changes that came along with integrating GPU devices with HPX, I'd like to give a bit of introduction about what HPX is and what it can do. So HPX is a runtime library which offers fine grained parallelism for CPUs and Xeon fee. So as such the integration of GPUs into HPX right now is a bit experimental and while we are designing a solution for this we want to minimize the application level latency that a new API layer would incur and we also need to minimize the CPU to GPU you know inter-processor communication time as well as the vice versa. So we also want to minimize IU stalls and data transfers. We want to overlap computation with communication as much as possible when it's not hindering. We also would like to provide a common synchronization mechanism what if you have some tasks running on the CPU some tasks running on the GPUs and other processors but what if you want to synchronize all of them using a common synchronization mechanism we want to we want to be able to provide that using this. So what is HPXCL? HPXCL is just the integration of GPU into GPU processing into HPX framework. So right now the focus is on CUDA. There is also a version which you know brings in open CL implementation into HPX. So it provides a uniform synchronization mechanism through the use of futures more about this in the later slides and we've reduced the API level or the common layer level latency which is incurred. We've benchmarked it to like against certain algorithms like dense matrix multiplication sparse matrix vector product which is highly data parallel. So here's how the presentation is going to be now I will focus on giving a brief overview about the HPX components as well as how the new components which we are bringing in integrates with HPX. After that I'll discuss a bit about synchronization of tasks measurements as well as the last would be conclusion. So if you look at this architecture diagram and just focus on the right hand side a bit. So you see that there are multiple localities which are separated by Agas and Parcel. I'll discuss more about each of these components in detail over the next few slides but this is just to give an overview of how things work with HPX. So first things first what is Agas? Agas is just an active global address space which maintains the list of active global objects in memory. It uses 128 bit virtual addressing space which spans across all the localities. Now if you see the diagram the shaded region the shaded region is the objects which are visible globally which means objects from one locality are visible across other localities and can be just used with the same set of references. The unshaded region is the local objects which are pertaining to that locality and are not shared across the localities. So the addresses of these objects are unique and system-wide that means you just use the same references across different localities. To obtain the devices in the domain for this integration of GPU with HPX we use Agas for just storing a system-wide level list of devices which are available across the localities. So then the next thing I would like to discuss is Parcel. Now we've discussed about an address space which maintains the list of objects in memory but what about intercommunication between these localities? So that in HPX that happens through a component called Parcel. The underlying implementation of Parcel can be changed from open MPI to like open MP, open MPI or just plain old MPI. So inter-node communication in HPX happens through Parcel. The implementation of this layer is one-sided and no pooling is required from any locality so as to minimize the wastage of CPU cycles. So the next thing I would like to discuss is the thread manager. Thread manager we use it for scheduling jobs like making sure the jobs are getting executed in the localities we want them to get executed in etc. So yeah as I just said the management scheduling of the jobs is managed by the thread manager. Actions are something which come from accumulators in HPX. So basically the actions provide a mechanism in which you can transport a function from one locality to another along with a set of parameters as arguments execute the result in a completely different locality and then bring the result back. So it provides remote invocation mechanism for transport a function. So there are some inbuilt settings which HPX provides so as to optimize the utilization of the hardware. So you must have guessed by now that locality is just set of nodes in which jobs like threads and jobs can just run and are being executed. So from this diagram you can see that different GPU like GPU devices can be attached to any one locality. We want to be able to think of this as a single unified computing resource and then be able to just schedule jobs to them and be able to synchronize whenever we want. So what are the new classes which we are bringing in through the integration of GPU. So we are bringing in three main classes device, buffer and program. More on this over the next few slides. So device is just like one it's a one-to-one device this device class and the objects which are created from it is just a one-to-one mapping of the actual physical device. So your GPUs could be spanning across different different localities which are different nodes and then using your agas you can have the references of all the devices available in your memory right now. So it's used for creating buffer and program objects. So yeah objects across locality can be accessed from one single place. So device also comes up with it's an abstraction from the native implementation. It comes with facility to build kernels runtime as well as to execute them when we need it. Yeah so the next to discuss about is buffers. So you can think of buffers as just you know data pipe. So basically buffers do the localities which require the usage of those data. So let's say I'm a threat I come across a particular particular reference of an object which I don't possess right now. So what I'll do is I'll go to agas. I'll get the I'll I'll I'll ask the agas to send me that copy of the data that is happening through parcels. So as buffers so in case of buffers we use this to transport data to locality we transport it to the GPU device attached to that locality. So it's just one buffer per data argument which needs to be passed to the kernel can be provided here. So the next thing I would like to discuss is about program. Program is just it's a class which gives you functions to do runtime compilation of any piece of code which you want. Currently it takes two different types of inputs. You can load kernels from files and you can load kernels from strings. So basically if you just see you can obtain like you can call the device and you can create a program with that particular device. And whenever you want to run that program you can specify the parameters of grade and block which you would normally provide to you know a CUDA implementation as well as the list of parameters as an argument vector. So we've discussed about different components which the framework is providing. The next thing I would like to discuss about is how do you synchronize jobs which are running in CPUs and GPUs and have just one synchronizing mechanism. So as I mentioned earlier we are using futures for it. So every time you call any asynchronous function what it does is it returns you a future. HPX provides two input functions, wait and wait all. Wait takes in a single future as an argument and just it's a blocking call. Wait all takes in a vector of futures and then it waits for all the all the futures to return their value. So using all these mechanisms it if you have an asynchronous execution tree of tasks you can execute tasks you can just start the tasks and get the result as and when you require them just before the next task in the pipeline is required. So how it all fits into place. So basically as I have mentioned earlier you have a get all devices function in HPXCL which gives you the list of devices which are connected across the nodes. To obtain a particular device from you can just use the vector and get any device from that vector. Using the vector you can create a program object. Program object lets you create a program from a file or a string as I've mentioned before which returns you a future. You can then initiate buffers for transfer to your kernel. So you can just create the buffer and then you can use the NQ write function to write something to the buffer. So just before let's say in this small snippet of code just before you want to execute the program you just wait for all the buffers to finish transferring data to that particular GPU. So this particular link provides you a complete example which multiple devices attached to your to the nodes to be able to split the computation into multiple smaller pieces execute them across different GPUs and then bring the result back and stitch into one single image. So this is just it's not clear but it's just a black and white Mandelbrot image which is used. Now if you carefully see the execution time and the plot to which which is here there is not much scalability on increasing from one node to like one GPU device to multiple GPU device. This is actually because of the fact that we are for measuring the time we are including the data transfer time like it's it's basically end-to-end time. So data transfer execution and then bring the result back like all of the timing is measured in this plot. So there's not much advantage in you know going from single to two to three GPUs but if we were just to measure the GPU execution time that would be highly scalable in this case. So previously I showed an example of how multiple devices can be like multiple devices can take the jobs and execute them in parallel. In this section I would just like to discuss about the overhead which this layer is bringing in. So as I said before we tested against several different algorithms like against let's say sparse matrix vector product dense matrix multiplication we see this has all been implemented on only one device. We see that there is not much difference between the native implementation and that of our our layer which with like the overhead incurred by our layer but our layer provides the capability to execute jobs on multiple different GPUs. So the same thing with that of the partition benchmark as well there's not much latency which is incurred. So what we discussed today was I just introduced about HPX as well as the integration of GPU processing along with the CPUs provided so this framework provides the opportunity to execute jobs on multiple GPUs in parallel along with the CPUs and synchronize them as and when you require it. Provide minimal overhead while adding a new common layer and then we showed an example and implemented it using 3 GPUs. Thank you. Any questions? Yes. We do have we have a version for OpenCL and we are currently testing it right now. So it is a common layer which you can call you know open like which you can supply like OpenCL kernels or CUDA kernels and be able to work with them in the same manner. You don't need to deal with differences or you know the language level differences. Yes. Okay. Yes. So basically what you would need to do is like as a developer the owner is still on you to identify the parts of the program which can be data parallel and then be able to just provide those data parallel parts to the GPUs for execution. The parts which still need to be in sequence you can use HP X 200 and CPU. Yes. Okay. Is it multi-coasters? Can you run multiple, can you run the same kernel on multiple hosts with separate GPUs on each host? Yes. So that was the whole point about different localities. So each locality you can think of them as a separate host and then this is a layer on top of your host. So that abstracts the host level differences.