 Hi, I am Prasoon Anand and I am here to talk about high-performance GPU computing with Ruby. I am really glad to be here and I really thank the RubyConf organizers for having me here. Very few people realize that even the modest computers today have very high-powered GPUs that can be used in parallel with the CPUs or in serial with the CPUs to deliver really awesome performance. So in this talk I would like to talk about two RubyGems that I have created in the last year. One of them is ARIFIRE Gem and the other is RB CUDA Gem. What these libraries do is that they help you accelerate your code for number crunching or scientific programming and gain performance improvements just by adding a few lines of code, maybe four or five per operation. So before we delve into the topic, let me introduce myself. I am a CyRuby contributor. CyRuby also stands for Ruby Science Foundation. What we do is we create RubyGems for the scientific computing. I worked as a Google Summer of Code student for Ruby Science Foundation in 2016 and 17. Currently I am associated with the Gene Network project. What we do at Gene Network is that we create tools for high-performance genome scans on clusters and GPUs and Teslas, even Intel Xeon 5s. Recently I have been awarded RubyGran 2017 by Ruby Association to work on RB CUDA Gem. These are the projects that I did. First is JRuby port of N Matrix. N Matrix is a linear algebra library and I just ported it to JRuby. Then I created ARIFIRE Gem for GSOC 2017 and currently I am working on RB CUDA. Scientific computing means Ruby has been around for 25 years but people still don't prefer it as a go-to resource for scientific computing or solving their problems or for number crunching. So in the last few years Ruby Science Foundation and others have created Gems for scientific computing. What we do is we handle very large sets of data for data analysis, machine learning and all. Currently, Syruby has the Gems like N Matrix, Daru, Niaplot. What the N Matrix is for linear algebra whereas Daru is for data analysis. It's just like Pandas in Python and Niaplot is a plotting library. Also, since Python has a head start, we use Python for solving certain problems which we can't directly do it in Ruby. So we have Gems like PyCall.rb that helps you call Python from your Ruby code for computing. So arrays and matrices, for any scientific problem, when you have any scientific problem the data that you have can be at the core you need an array or a matrix. These arrays and matrix are of huge size. For example, if you have a block data you have a matrix of 5000 rows and 5000 columns which is the least amount. So to handle these large arrays and matrices you need specialized libraries. For example, N Matrix helps you handle matrices on the CPU. What it does is that these linear algebra libraries that need to handle matrices must be memory efficient and need to be fast. You need fast loops to iterate through the entire elements of the matrix or an array. Moreover, you need to save memory because since your RAM size is limited you need to handle efficiently so that you don't run out of the RAM. But as we have BLAST and LAPAC libraries, what it does is that BLAST and LAPAC are Fortran libraries that help you do matrix computation by harnessing the multi-core support of the CPUs. So whenever you need to do scientific computing or number crunching in C, you need BLAST and LAPAC or Eigen libraries or Intel MK libraries. So since BLAST and LAPAC are Fortran libraries we have C bindings for it and N matrix calls these C bindings for linear algebra. Similarly, NUMO is another package that provides an array which does the same thing. Means N matrix and N array are almost provide the same functionalities. So let's move on to GPU computing because GPU computing is not easy. Means for a beginner if you are trying to do GPU computing on C you need to handle pointers. And currently GPU computing is done using writing kernel codes where you write C type of kernel code in .cu file or .cl. What you do is that when you have these kernel files you compile it and then you inject that code into the GPU hardware. And then you need to handle the pointers that you have created and perform operations on it. So CUDA and OpenCL, these are the two platforms that we currently use for GPU computing. While CUDA is limited to NVIDIA hardware and it's a proprietary solution, it can't run on all the other vendors like the GPUs from AMD or Intel. Whereas OpenCL stands for Open Computing Language and it runs on across all GPU hardware regardless of the vendor. And in my experience I've seen that CUDA has a better performance over OpenCL even on NVIDIA GPUs. So here comes ARIFIRE. ARIFIRE is a C library that is used for general purpose GPU computing. What it does is that it's an abstraction over an array where you create an AF array that's on the GPU device. And here you don't have to bother about what kind of hardware you are using, whether a GPU which is from NVIDIA or AMD or Intel, or what should you use whether the CUDA is more suited to your needs or OpenCL, it just tries to give you the best performance. And recently ARIFIRE also supports CPU. For example, in case you don't have an access to a nice GPU on your machine, you can just use ARIFIRE, it means it will automatically try to run that code, the same code on the CPU. Also, ARIFIRE has wrappers in Python, Go, Julia, Rust. And what I did was I created the Ruby wrapper for ARIFIRE and it really makes our work easy. So this is how you create an AF array, means AF array stores your array. It can be up to four dimensions. In this slide, I show you how to create an AF array of two dimensions with the row size. So the highlighted syntax there shows that we have A as an AF array which has a dimension of two. And the next argument, 2, 2 is the size of the array, means you have two rows and two columns. Next is the elements, 1, 2, 3, 4. So after that, when you create a matrix or an AF array using this code, you get the elements as shown below. This is a column-major format, so you can see that 1, then 2, then 3 and 4. Next is we try to add this, add the array A to itself and store it in B. This is the code that shows it. So it's pretty easy, means 1, 2, 3, 4 added to itself gives you 2, 4, 6, 8. Next is the matrix multiplication. If someone is here is familiar with data science or means we use matrix multiplication most of the time in our code for number crunching, means here we have, I've created two arrays called left and right. And one of the arrays is of dimension 3, 3 while the other is 3, 2. Next is we do matrix multiplication as simple as this. So how we implemented it means what we do is I create an AF struct type called AF array. And then in the next highlighted line of code, I just cast these values from the Ruby VM I got from a num to a double data type. Next is that I create an AF array using AF create array API provided by ArrayFire. And it just copies the host array data into the GPU data. Means in GPU computing, you just can't get access to your data directly. Means when you create a data, when you create an, you initially create an array on the host device that is your CPU. Then copy that array from the CPU to the GPU. And then on the GPU, you pass the kernel code that interacts with that array. And when you get the final result, you then just copy that data back from GPU to the CPU. But in the case of ArrayFire, you don't have to worry about that because it just abstracts that and makes it as simple as this where you created an AF array. And in the next example, what I do is that I just get that pointer and I do a matrix multiplication of it. For example, in the first line of first highlighted line, we have created an AF struct left. Then we also created an AF struct result and we allocated the device memory, the memory to it. And next we call the AF Matmill API. What it does is that it takes the device pointer of left and right and then does multiplication of it, multiplication and then stores it into the result. So these are the BLAST functionalities and the LAPAC functionalities. BLAST functionalities are matrix multiplication and transpose, whereas the LAPAC functionalities are determinant calculation, inverse calculation, calculating the Frobenius norm, and then QR factorization, Cholesky factorization, SVD factorization, and lower upper factorization. ARIFIRE also provides you with APIs for calculating mean, median, or variance along different dimensions of your matrix. And these are provided by the mean, AF mean, AF median, and AF variance. Next let's come to the benchmarks. If I mean how is this really provides you high performance, how it accelerates your code. So I ran the benchmarks on an AMD FX8350 processor and NVIDIA JTX50 TI GPU, which is of the Maxwell architecture. And the descent one is Pascal, but yeah, means it's a decent GPU and we use double D type and with the code back end. So for calculating the matrix determinant, it takes around, means in this graph on the x-axis we have the number of elements in a matrix, whereas on the y-axis we have the computation time that it took for us to do that operation. So the lower is the computation time, the better is the performance. So we are comparing N matrix LAPAC Ruby, N matrix JRuby, and ARIFIRE. N matrix JRuby is what I created, means for, which is a JRuby port of N matrix, and N matrix Ruby LAPAC uses LAPAC for matrix calculation. So in this case N matrix LAPAC Ruby takes around 12 seconds for determinant calculation, whereas ARIFIRE takes around 2 seconds. So we have an improvement of 10x, means ARIFIRE is faster than N matrix LAPAC by 10 times. So yeah, we did a nice job. Similar goes the case for matrix LU factorization. Means when you do an LU factorization, the next step is that you can calculate the determinant from the diagonal element. So this benchmark is exactly the same as matrix determinant calculation. So for matrix addition we have this benchmark, means N matrix Ruby takes around 6 seconds, whereas ARIFIRE takes around 0.0004 seconds, means 400 microseconds. So the performance improvement is 10,000x. Matrix subtraction is similar as matrix addition because both are element-wise operations. Instead of just adding the two elements, I'm just subtracting it. So the exact same figures. Now comes matrix multiplication. Means at the crux of any scientific computing code we have matrix multiplication. Means we call it a lot of times. And in this case, N matrix Ruby has two ways how you can call this BLAST routine for matrix multiplication. You can either use N matrix BLAST Ruby or N matrix Ruby. So N matrix BLAST Ruby is faster because it uses FORTRAN whereas N matrix Ruby runs C code. So in this case N matrix BLAST Ruby takes around 31 seconds whereas ARIFIRE takes 0.00062 seconds. So 620 microseconds. So the performance improvement is 100,000x. So coming to all this means when you use ARIFIRE you don't have to worry about what kind of GPU hardware you are using. Means you just write your code without worrying about whether you are going to run that code on a CUDA device, CUDA platform or an OpenCL platform or an NVIDIA GPU or an AMD GPU. It just tries to give you the best performance out of it. Means you can also tune it according to yourself. Next, since NVIDIA has a better performance on GPU devices. So what I did was since ARIFIRE is an abstraction I tried to create something that would be even closer to the GPU hardware. So for that I created another project called RB CUDA. Means it runs only on NVIDIA devices. So since ARIFIRE it was very easy because we didn't have to worry about transferring the data from the CPU to the GPU or vice versa. Here we need to handle everything. Means we need to take care of how you have created the GPU array pointer and then how we copied from the CPU to GPU. And see that if we have the pointer which is not garbage collected and all. So what we do is we created a generic pointer that is void start. Means it just stores the device array location in the VM. And then you copy memory from CPU to GPU. And yes it has been interfaced with N matrix and N array. Means you just add one line of code that A is equal to N matrix has two GPU and you will get a GPU pointer. Similarly we can do it with N array but it's under development right now. So this is an example of a kernel code. So when you have created your program and you think that you can create more optimizations on it. You might be interested in running your custom kernel code on the GPUs. So RB CUDA helps you do that. Means you couldn't directly run your custom kernels on the GPUs in ARIFIRE. But yes with RB CUDA we have created a bridge that can help you run your custom kernel on the GPUs. So this is how a kernel code looks like. Means we have a block idx.x. Means it just refers to an element in the blocks. And when we call this we have two arrays in A in star A and star B. And we add these two arrays and we store it in C. So what we do here is that what RB CUDA is different from running this CUDA code kernels directly on the C. Is that you can run this kernel code online. Means you are running your code in pry. And you can just inject this kernel code. So what I do is that I take this kernel code. I store it in a temp file and then I compile it using NVIDIA CUDA compiler. And then as a result I get a .ptx file which can be run on NVIDIA GPU. So this is the code. Means it's tough to understand. So also running a custom kernel code was already done by another Ruby gem called SGC Ruby CUDA. But what it lacked was it didn't provide the solutions for other libraries. Like it didn't have support for CU Blast libraries, CU Solver and CU RAN. So in RB CUDA we will have support for all this. Means we will have readymade routines for Blast and LAPAC. Means you can do matrix multiplication and even matrix decomposition. And you can also create random numbers using Mercin, Twister and other engines, random engines. So these are the benchmarks. Means again the benchmarks were done on this AMD FX8350 octa core processor. GTX 750, 50TI GPU and double D type. So for matrix multiplication you can see that the lowest line RB CUDA is even faster. Means N matrix Blast Ruby takes you around 31 seconds. ARIFI takes you around 0.0006 seconds. Whereas RB CUDA takes you 0.0004 seconds. So we have a performance improvement of 1 million times. So here comes the future work. Means ARIFI being a GP GP library means a general purpose GP computing library provides you readymade routines for image processing. And it also helps you write classifiers and all for machine learning. So I'll be working on creating these APIs and even indexers. And currently only double data type is supported. So in the future we are going to have support for complex floats, etc. Now RB CUDA is under active development. It's being kindly funded by Ruby Association. Contributions are welcome. You can check out these repos. And benchmark code can be found on ARIFI... Means github.com slash prasoonanand slash ARIFI RB benchmarks. So you can try it on your machine. Since I ran these benchmarks on a Maxwell architecture that is 750 TI GPUs. When you run it on Pascal GPUs that is 1050, NVIDIA 1050 series, you can expect a performance of even 10 times more. Now acknowledgments. I would like to thank my Goga Summer of Code mentor, Peter Prince. He's involved with Bioruby project and other projects in D and Scala. And next is Pradeep Garigipati. He's a core contributor of ARIFI. Also I'd like to thank Sai Ruby, Goga Summer of Code and Ruby Association for helping me continue my work in the field of open source. Thank you.