 We have the seats. So nearly the last joke. It will be given by person. He comes from India. He's still a student and he did the sum of code for Google. And at that time he has the opportunity to dig into scientific computing and for JRuby for instance. But also he did a lot of work on GPU computing and he developed a binding for Ruby and JRuby on CUDA which is a C++ framework for GPU computing. So you see that he has a quite good understanding of scientific computing. Please welcome him with me. Thank you. Hi, I am Prasoon and I will be talking about scientific computing on JRuby. So the main objective of this talk is to tell you how to create a great gem or tool that uses JRuby and that's very highly efficient and has a very good speed. And this is all because the scientific library that I worked on is memory intensive and speed counts. So I'd like to share my experiences with that. Next I will be introducing you to the general purpose GPU library that I worked on. And this was after my GSOC. And this library can be used for industry in production and academia for research. And yes it can be integrated with Rails. So SciRuby is also called as Ruby Science Foundation and it has been trying to push Ruby for scientific computing. Some of the popular SciRuby gems that we have currently are N-matrix, Daru, mixed models, NAPLOT, IPython notebook and many more. So these are the most commonly used ones. So what N-matrix is? N-matrix is a SciRuby's numerical matrix code. It helps you to perform linear algebra calculations on your CPU. And it has support for both dense matrices and sparse matrices. Basically it helps you to analyze matrices which you might have studied in your high school where you used to solve linear equations using matrices. And N-matrix for MRI relies on atlas, C-blast, C-lapac and standard lapac libraries for its linear algebra operations. These libraries are state-of-the-art libraries because they are written in Fortran and are highly efficient and are used for number crunching. So this is how you use an N-matrix gem. You just require it. And in the first one, second line of code, I create an N-matrix. This you can see that it's a two-dimensional matrix of three rows and three columns followed by the elements it contains. And the type of the matrix is float 64, which is actually double. So double is used for scientific computing because it has a high precision and you can't use simply float 32 or float. So I created a matrix of three by three. You can see that it's a row-major format, 1, 2, 3, 4, 5, 6 and 8, 9. In the next line of code, I just add these two matrices. And in the third line of code, I calculate the determinant of this matrix. Next, Daru. Daru is what pandas is for sci-pi. It helps you to load data into a data frame from, say, Excel file or CSV files or TSV files. And then you can analyze this data. So in this code, what I've done is that I simply require Daru. In the second line, I just load this data from alien spaces.csv. And next I do is that I create a vector and I just show this data frame. Now next is mixed models. So after you have loaded the data, maybe you want to perform some computations like model this data. So this can be done by mixed models gem. And in this code, we try to calculate the fixed effects and random effects of this model. And thus we can also use this to predict for some random data. Next is a NAP plot. Any scientific library is not successful unless we have some great visualization tools. NAP plots fit this gap for Sci-Ruby. It's used for 2D and 3D plotting and it's built on the top of D3. Next, why you should use Sci-Ruby instead of sci-pi? Simply because we love Ruby, we love Rails and expressiveness of Ruby. When you collect large amount of data from a Rails app, you can just fit this data into a Daru data frame and then you can perform computations and you can use this Daru as an analytics tool. Next is J-Ruby. My project for GSOC was to port and matrix to J-Ruby. And basically Sci-Ruby was trying to push its gem towards J-Ruby runtime because of the speed. It's actually 10 times faster than C-Ruby on certain aspects. And with Truffle Ruby, it's around 40 times faster. And right now we just heard that it's going to be around 26 times faster for just J-Ruby classic. So this bird, I like to call it as Roadrunner because I'm just dealing with speed. And I want to make this Roadrunner go beep-beep. So N matrix for J-Ruby. So N matrix for why? I will build N matrix for J-Ruby because it has no global interpreter log as in the case of MRI. So it's a very good team player. When you have a large processor with multi-course, you can utilize them. The program you develop can be easily deployed with the help of Warbler gem. So you can develop any gem on J-Ruby or any program in J-Ruby and just deploy it to your supercomputers or cluster of whatever you have. And then J-Ruby has auto garbage collection. Whenever you develop a C extension for C-Ruby, you have to take care for garbage collection because you are dealing with large data. And when C-Ruby does the mark-and-sweep model, as we have heard in the previous talk, that this mark-and-sweep is not really good to handle. And when you have a very large matrix, say 2.5 million elements in a matrix, you can't really, means speed is not good in such cases. Next is N matrix for J-Ruby that I worked on. It relies on Apache Commons Math, which is a Java library. And it has very good developer activity around this library. So MD array. Before I built N matrix for J-Ruby, we had MD array that does this, an MD array gem that was just like what N matrix was for. C-Ruby, it was the N matrix for J-Ruby. So when we try to create a unified interface for N matrix for Ruby as well as J-Ruby, just like Nokogiri does it, that you have a simple gem and you can run it on both instances. So we could have used MD array and just built a wrapper around it for N matrix. But indeed this was not going to be a good idea because it used parallel called as a dependency, just like I used Apache Commons Math. And it was depreciated at the moment. And next is that every gem when you create with the N matrix, you had to reimplement it using MD array if I didn't build a wrapper around MD array. So you have to put in more effort for optimization. For example, you have to check whether the data is not getting copied like that. So how N matrix works? Basically N matrix can be divided into categorized into two parts, N dimensional and two dimensional, two dimensional. So this is how an N matrix object looks like. You have its shape. That is, for example, I created a three by three matrix. Its shape is three comma three. Its D type is float or it can be anything like int complex like that. S type is like whether you want to store it as a dense matrix or a sparse matrix. Add S is basically the storage, the elements that go into a matrix. Now you have the add dim. This means that how many dimensions you have? For example, three comma three is a two dimensional matrix. Next is N matrix architecture. Basically, we have a Ruby front end here. Then we have for MRI, we have a shared object extension.so, which is built using C and C++. Now this integrates with N matrix C library. Sorry, a native C library which is C++ or C++. And this C library is connected to the FORTRAN library. That is BLAS and LAPAC. So basically you have three layers. And this one, the FORTRAN library is what makes your computation go fast. For JRuby, we have just extension.jar. It is built upon the top of Apache Commons Math. So, for N dimensional matrix, the major thing operator we had to build on it was element-wise operation. For example, add, subtract, sign or gamma. In the first slide where we added two matrixes, this was just going in element-wise operation. We iterate through the elements, we access the element, do the operation and return it. So the major thing that counts here is you have, your loops should be efficient. You can iterate throughout the array and whichever is the first, whether CRuby or JRuby. So here in this slide, I show you that when I created a matrix N1 and I just add them. So N1 plus N1, you get this matrix. For example, 1 plus 1 gets 2, 2 plus 2, 4. Next, what were the challenges that were faced? First is auto-boxing and handling multiple data types. Second is minimizing the copying of data. So in JRuby, we have this that when we tried no strict typing, basically directly creating a data using JRuby code, we got this error. For example, in this case, we have two arrays. We just add it and if this array is smaller than one, you start getting the values as 0. So I think Charles can answer me this question after this talk. And if I just simply add this any number greater than 1 to both of them and I subtract minus 10, so I get the correct result there. So next is auto-boxing. What we did was, as this project was in initial stage, I just dealt with the float 64, that is doubles only because we had to deal with reflection and that's all. Then for strict, we created data types using Java and we couldn't rely on reflection for large data. For example, if you create storage as array.new and storage as Java double with rows and columns.new, second one will be definitely faster because you know the size of the array and you can loop faster through the second array. So now we have auto-boxing and enumerators. This is a classic case because when we tried using a matrix for JRuby for real data, for example, we have a 5.5 GBs of data which we get from a blog and when we try to run it, we get an error here. As you can see this highlighted code, I just created this array as array.new. So what happens is that you start losing precision. So if you just implement this in using Java code, so here you don't lose precision. Next is minimizing the copying of data. As we had heard in the previous talk from Charles that JRuby takes more memory. So basically what we are trying to do is we take the storage. We are trying to build an enumerator that each with index. We have to get elements defined by indices and if there are blocks passed, I just want to do some computation with it. So I just convert this storage to an array and then I take the slice index and I try to push it in this array and here we start losing values. For example, if we were trying to factorize a matrix, so every time you do it, if the value is 0.02 or 0.03, this gets to 0. And when you are trying to optimize something and you have to iterate through this matrix, you have to perform the same computation through this matrix, it won't get optimized ever because you can't just go to a convex point where this value converges, so you can't reach a minima. So next is minimizing the copying of data. So whenever you build a JRuby application, you make sure that you don't make copies of data because already it consumes twice the memory than CRuby. So if you start making more copies and you have a larger matrix, it may totally destroy your GC. For example, there was a computation where CRuby takes around 50 seconds and when I was just building n matrix for JRuby, it took me around 1.5 hours. So after I made sure that I'm not making copies of data, I took down this time to 40 seconds, which was better than CRuby. So you just pass by your friends. You just create a static method as helpers. We will see this example later. So next is two-dimensional matrix. Now you have an n-dimensional matrix and you want to perform certain computations like you need to multiply two matrices or you need to factorize it. Then we need to just cast this n-dimensional matrix into a two-dimensional matrix. So the basic operations here are dot, which is matrix multiplication, dead for the calculating determinant, and factorized LU, means lower upper factorization. So in n-matrix MRI, we have BLAST3 and LAPAC routines that are built in Fortran. But n-matrix JRuby means it depends on Java functions from Apache Commons Maths. So the challenges were as follows. You convert an n-dimensional matrix into a two-dimensional matrix. So actually n-dimensional matrix is stored as a 1D array because you can't just simply store a multi-dimensional array because you can't just lay down them in the RAM properly. Now you have the array size matters and you have to access the elements with speed and you have to also take memory in consideration. So we have an n-matrix. I need to get a 2D matrix from this. So I use these helper functions. These are static methods from a matrix-generated class called get-matrix-double, and the array-generated class has a get-array-double. So in this code, I'm trying to iterate over a matrix. It means basically a two-dimensional array. So I just benchmark this Ruby code. You can see that in the first line of code, we create a Java, basically a two-dimensional matrix of size 15,000 to 15,000. And I just benchmark this code where I initialize the value of this matrix as the index. So this takes me around 39 seconds. And then next I iterate over two arrays where I just copy the elements of one array into another. So this takes me around 65.12 seconds and 5.4 GB of RAM. So when you have such large matrix, your RAM will be already consumed by that time and you can't do it for any number crunching program. So yes. And I do this similar thing in Java. And for this, the time requires 0.031 seconds. And when we have two arrays, it takes 0.033 seconds. And the RAM consumed is 300 MB. Hence, speed is improved 1,000 times and memory is improved 10 times. And this is actually speed-wise better than CRuby. But memory-wise, not as good as CRuby. CRuby takes less memory, but speed is slower. Next. Also, 15,000 is not a very large matrix, right? Yeah. 15,000 into 15,000, which is 225 million elements. Yeah. So there's another gem mixed models that relied on N matrix. And simply by porting N matrix, I just ported this library to JRuby2. And so you can model your data in JRuby now. Now we benchmark N matrix functionalities. These are the system specifications. I had an octa core CPU and a 16 GB RAM. Next form, in this graph, you can see that the number of elements in a matrix. This is, for example, 5,000 by 5,000 elements. We have 25 million elements in a matrix. And then on the y-axis, you can see the computation time. So lesser the computation time, the better the speed. So this is a logarithmic scale. So in this case, N matrix JRuby is faster than N matrix MRI. It is around 40 times faster. Next for subtraction, we have the same case. It's 40 times faster. Next for gamma, we have this, we compare 3. N matrix MRI, then N matrix JRuby and N matrix MRI LAPAC. You can see that on N matrix JRuby, it's even faster than the Fortran code. And it's a lot faster. It's 10 times faster than N matrix MRI and MRI LAPAC. And 400 times faster than this N matrix MRI. For N matrix multiplication, N matrix JRuby loses to N matrix CRuby or MRI. Because we definitely don't have Fortran libraries for this. And so any code, any Java code can never beat a Fortran code. Even though we have jitting and this code will get improved after running a lot of loops, but still not get better than Fortran. Similarly for determinants, it's again 20 times slower. For factorization, it's same 20 times slower. So these are the benchmark conclusions. N matrix JRuby is definitely faster for N dimensional matrices and 2 dimensional matrices when element wise operations are concerned. But N matrix MRI is faster for 2 dimensional matrices when you need to calculate the dot or determinant or you need to factorize it. So how can we improve this? So the solution is that we also implement the backend of N matrix JRuby with Fortran. And this time it will rely on Java native interfaces. And there was an option to use another library called Jblast, which is a JNI for BLAST libraries and LAPAC libraries. But when I tried to use them, it had a lot of bugs. And it was efficient enough. So this would be the final architecture of N matrix JRuby. So we took this left side of this diagram to this right side. And overall when we do this, when I tried to do a sample benchmark for this, it was actually faster than N matrix MRI. So maybe in 2 months N matrix JRuby would be faster on all aspects than N matrix MRI when I implement this part of the code. So also the future work includes implementing N matrix for complex data type and adding sparse support and convert N matrix JRuby enumerators to Java code for better speed. So overall whenever you try to do any computation using N matrix JRuby, it would be currently as far as N matrix CRuby or even sometimes faster because the overall calculations that rely on a single dimension, that would be faster in case of JRuby and obviously JIT. So am I done? No? Enter GPU. So after Google Summer of Code, I wanted to go even faster for number crunching and that's why I thought of implementing a general purpose library for GPU computations. And the aim of this project was to combine the beauty of Ruby with transparent GPU processing. And yes, this will be tested on both client computers and on servers that make use of Teslas and Intel Xeon 5 solutions. Before this, there was no Ruby project that was good for GPU computation maximum you could do was add two matrices using GPU and that also was for 1000 by 1000 matrix. But yes, the RFI library that RB library that I developed can handle around 15,000 into 15,000 elements matrix easily and we get to test it on supercomputers. So what is RFI actually? So RFI is an open source GPGP library that is written in C++ and uses JIT. So using JIT makes it even faster than CUDA and OpenCL. So yes, the competitors Pi CUDA and Pi OpenCL would be slower than RFI RB. RFI has also bindings to Python, but RFI Python is built using Cython. So whenever you try to do this, this code is really not tested and try to do some computations using RFI Python on my system and actually most of the times the program hang. So and when I did the same thing with RFI Ruby when I built it for MRI it was working properly and the speed was awesome. Next is RFI Ruby can also help you to just go computing without writing kernels. So whenever you try to do GPU computing most of the time you end up writing kernels and this is where most people just leave GPU computing. But in RFI you just have readymade formulas and it can handle matrices of any size. It automatically scales. So how do you use RFI Ruby? So you just in the first line of code I just created an RFI matrix. It's a two-dimensional and two rows and two columns. I just add these two and in the next I show you how we implemented BLAST routine where I multiply two matrices. And then in the third part of the code shows how to get the determinant of a matrix. So let's see how its RFI Ruby is built for MRI. It's a C extension and the architecture is inspired by a matrix and an array. Basically RFI is written in C++ so when we try to build this library we need to handle C++ code. So for that these instructions help you how you can do it. And the most important point to note here is that you just get rid of this mangling errors in C++. For example, this is how I implemented matrix multiplication. I just included ruby.h and I created a data structure for a RFI object. Then I just bind it to Ruby front-end in it RFI. I create a function called arf-madmul. So this is being casted here because this code is actually in a CPP file and what Ruby front-end expects it is to be in a C file. So next I do an arf-madmul. So this finally goes to this line of code where I just multiply it. Here we can see that this line is for whenever a GPU computation is done it's done on a GPU RAM. So you need to get this memory into the CPU RAM. So this line helps you to do this. So when we try to integrate RFI with Rails this line would be very helpful and you may be wondering that why GPU computing on Rails. Because you have data. Means you collect data through active records and now you can just analyze it. For JRuby the approach is same as N matrix JRuby. We have a Java native interface for RFI already built and even I implemented some of the BLAS and LAPAC routines. This works on RFI Java. For example the last example we saw where we did matrix multiplication in C code. This can be easily implemented in JRuby just using this part of code. So here you can see that how awesome JRuby is. Now we benchmark RFI. System specifications are as follows. We have same octa-core processor with 60 GB of RAM and GPU was NVIDIA JTX 750 Ti and RAM is 4 GB. So in this I think you can see it here. And this is around 10 raise to power 5 times faster than N matrix Ruby and 10 raise to power 4 times faster than N matrix JRuby. And again RFI is faster even here. It's around 10 raise to power 6 times faster. For matrix determinant it's around 100 times faster. Even for factorization it's RFI is 100 times faster. So transparency. After actually the current when you try to find RFI on GitHub you can see the code written only for MRI. For JRuby I have already done the groundwork. It just needs to be in the repo now. Can you go back to those figures because I think you are a bit too fast. Okay. It's more than 100 times faster to me. Okay. Maybe 1000 times faster. So transparency. We need to integrate it with N array and N matrix and with Rails. So where I showed you that just you just copy the RAM from GPU to CPU. You can just create some function methods for that. And yes Rails is now on GPU and similarly as N array and N matrix. So what are the applications? Actually RFI has endless applications. You can use it for bioinformatics. You can integrate it with TensorFlow when it's ready. So you can just create a GPU cluster using RFI and integrate it with TensorFlow. You can use it for image processing. You can use it for computational fluid dynamics. So hence beep beep. These are the useful links. You can go to for N matrix. You can find it there. And RFI RBZ. And the latest code can be found on RFI TEM branch. I would like to acknowledge my mentor Jotlpins and Charles Nutter and John Woods who have been very helpful to me in creating these libraries. Both RFI and N matrix. Next is Alex Gossman who developed the mixed model gen. He developed Daru. And then we have Pradig Garigipati from the RFI team. He is also mentoring me with this project. So also I would like to thank Emerging Technology Trust. That is the organizer of RubyConf India. And currently they have sponsored my travel to Fosden. And thank you. Thank you very much. We have time for questions. Did you consider other libraries like Boost, CMD or VC for the parallelism using the hardware? Hardware for what? N matrix or RFI? For sending more than one value to the processor for the computation to be done faster. See, your question is basically that I need to send the values to more than one course for processing, right? So basically for N matrix it is done by Fortran and Fortran is already optimized for this. It lays down memory in certain way that it uses parallelism for that. So yes, it's done by that way. And a Java code can beat that code. Okay, next. Actually Fortran doesn't use MMX or other processor instructions created by that after the Fortran language was defined. The Fortran libraries are really old for that. Usually you have now two libraries in C++ that are able to do that because they know how the processor was built. The name of those libraries are VC and Boost, CMD. Okay. If you want to take a look on that later. Okay. And there is also one library that I helped to write for matrix processing named Eigen3. That's really fast and I think it's... It's for R. Eigen3 is for R, right? No, Eigen3 is C++. Okay. But R uses it mostly. Yeah. RCPP and R Eigen. So I think that's slower than LAPAC and BLAS. Okay. Next question. So great work on this stuff. I just just a request. Okay. Anything that looks like a Bob or you've got problems comparing to MRI, even if you're not sure, just go ahead and file. Okay. We'll be able to tell you right away if there's any problem. Like that matrix thing. Yeah. I don't know. We are poking around over it. Okay. So basically the problem is that even I don't understand whether the problem is with JRuby or with JVM. So... It's probably us. Okay. I was just assuming it's us. Okay. But you added two arrays and normally that plus would just trapmate those. So do you override them? No. I didn't concatenate those. I just add these two elements. Means two... For example, the area is 2, 2 and other is 2, 3. So we get the final area as 4, 5. Yeah. Means one... Okay. And it's just not adding right. It looks like it's coercing as a fix. Yeah. Yeah. And actually this can't be reproduced. Sometimes you get 4 out of 10 times, you get a 0 or else it works fine. Yeah. What do you see is the integration of this with Rails? Yeah. What kind of integration do you think you can do that? So basically if you have some... Maybe you can try it with Active Record. Basically I'm not consulted. I consulted some Rails developers at RubyConf India and they had inputs like we can do it with Rails. And so it would be widely used. Other questions? Thank you, Preston. And a very good... Thank you.