 Good morning, everyone. My name is Rodrigo Rubaton, and I will try to be fast so everyone already can go. If anyone wants to download these slides to follow up while I'm talking, just scan this QR code. So, we'll talk about data science in Ruby. If we can do it, if it's fast, if we should use it, and when to use it. So, anyone here who works with data science, this data scientist, data engineer, anyone writes any application that uses data? I thought this time I would see some hands up. Let's start defining what is data science. This is a tricky question, and every company has a different definition of that. Some people say that data science is the process of extracting meaning and interpreting data. Others define as the usage of statistics and machine learning to clean and manipulate data. Also, there are people that say that any usage of computer software to collect clean and manipulate data is data science. It's also a cool name to the combination of data mining and business intelligence that are other buzzwords that have been around for at least 20 years, probably more, but they use it to have more expensive tools and cheaper workforce. I prefer now that the tools are cheaper and the professionals receive more money. So, can Ruby do data science? The quick answer is yes. We have libraries for everything that's needed for data science, but we'll need to dive a little bit more in that because as with any software related question, the best answer is it depends. So, in Ruby we have libraries to integrate with other tools like R and Python. We have data manipulation libraries, gems to do distributed computing, gems for data structures like Daru. We'll see samples for every of these groups later. We have some ready-made data sets like the Iris data sets that was been used for Sharte. We have statistics libraries. We have data visualization libraries, libraries for interactive computing. And some things are really good and some things are not that good. For example, some libraries that, some gems work together. Some don't. Some, there are gems that do the same thing. One is really fast. The other is not. And we'll need to define exactly what we want to do and how to integrate things. And sometimes that's a tricky problem. So, interactive computing is, for example, using Jupyter Notebook while you can just write some code sample, execute it, update, use the result below and show it to anyone that might be interested in that or use it to evaluate data, clean up, test again, update your code. It's really useful for any data science related code. And also to do some teaching. I started using Jupyter Notebook for some small Ruby classes, too. We have libraries to integrate Ruby with Python, for example. I found out about PyCall that was created by Kenta Murata, if I'm not wrong. It allows you to write code in Ruby using Python modules as if they were any Ruby library. It's really good and really fast and solved a lot of problems for me. And there are similar libraries for R, but I haven't tested that before yet. We have libraries for data manipulation. This Kiba library helped me a lot in a project where I had to collect data from five different sources, including databases and files and web. And it helped to do that in a clean and declarative way. So also helping me find bugs in that integration easier. And Jungler is a similar tool to integrate many ETL jobs at once. We have tools for distributed computing, but here the libraries start to, the things start to get ugly because Apache Spark is a really great project, but both integration libraries for Ruby and J-Ruby haven't seen a commit in the last three years. So if you are already using it, great. If you are starting something new, it's probably not the way to go because there are some open bugs and nobody maintaining the projects. We have libraries for many different data structures. For example, Daru, that's the implementation for data frames in Ruby. Data frames are the, I think, most important data format for data science. It allows you to clean and manipulate data very easily. We have Numo NRA, that's an N matrix that are very similar libraries. But NRA is really fast and doesn't work really well with Daru. And matrix has some performance issues. There is a bug open in their backlog for the last two years about this performance issue. And it works really well with Daru. We have some ready-made data sets in our data sets project with many data sets that's collected from the R language. And we have the red data sets with a growing collection of data sets available to use for samples and to use in any application. We have many libraries for statistics. None of these lists is complete. There are just more, the more important libraries are the ones that I have tested. We have the RB GSL, that's the interface for GNU-specific scientific library. I think this is one of the most used scientific libraries in the world. And I really like the innumerable statistics that was also written by Kanta Murata. It's a really fast way to do some simple calculations on any enumerable, like an array or an active record results. We have some lots of libraries for data visualization. I need to add Charity to this list now. We have an implementation of Matplot library. Matematical is really cool for rendering meta-calculations. We have Daru view that was my go-to visualization library because it's integrated in Daru and you can use it to render the results in Jupyter Notebook or in any web application. It's really easy to use. And Daru Plotty is really well integrated in Daru too, but I just used it in Jupyter Notebooks. So what's the current state of data science in Ruby? We have a big advantage and a big problem at the same time. In Python, all the data science effort is around the SciPy project. In Ruby, we have three different projects with three different approaches. Most of the data science tools are under the SciRuby project, which has many n-matrix-centric gems, and it has the Daru project for data frames, NuPlot Rb, Statsample, and many other libraries. We have the Ruby Numo project with n-race-centric gems, and lots of similar implementations like NuPlot in Numo, NuPlot on SciRuby, an array and matrix. We have a small number of developers working on data science in Ruby, and the effort is divided. So because of that, the Red Data Tools project was created. Their idea is to be interoperable between the other two. They are created to have a patch arrow as the back-end. What I think is a good idea because the next version of pandas will also use it as a back-end, and it has some Red Arrow and matrix to read n-matrix data, Red Arrow, Numo, n-r-a to read n-r-a data. So today you can use it to interface the data model of one project with the other. The big problems I see here are that, for example, n-matrix has this performance problem for at least two years. n-r-a is way faster, but you can't use it with the arrow. So if you are doing data science in Ruby, you probably should go with the SciRuby project, but if you are doing just scientific computing, the Ruby Numo project is probably best for you. And the Red Data Tools project is a kind of new project, and it has a small problem now that it doesn't have data manipulation libraries just input and output. It will probably be solved with time, but that's the current state. So doing data science in Ruby is hard. We have the tools. Some tools work really well, others not so well, and we lack documentation because there are not many users and not many developers because there are not many users. There are not many developers because there aren't many developers. There aren't many users. So I think everyone agrees that the current crown jewel of data science is Python. And we have mostly the same functionality in Ruby, too. In Python, when you start learning data science, the two first libraries you learn are pandas and numpy. In Ruby, we have diro and matrix to do mostly the same thing. They have almost all the same features, not as much documentation as the Python version, but the CyRuby project has a decent amount of documentation. But I decided to do some testing, and I created a Jupyter notebook to simply do a sum with a big number of random numbers. I started with 50,000, and it was enough for the results that I was looking for. The results will not be exactly the same with different numbers or different operations, and I think this is probably the worst case scenario. So I did mostly the same operation with N matrix, and the Ruby version took 0.3 seconds to run, while the Python version took 0.003 seconds to run. So just doing a sum of 50,000 numbers using N matrix is 100 times slower than using numpy. So I did some tests with pandas, created a simple data frame with the same size, 50,000 numbers with random data, and got the median of the first column of the data frame, and it took 708 microseconds. And I did the same, mostly the same operation with Daru, with what took me more code, probably because I'm not really good with Daru, and I didn't find that many documentation to clean up the code. But the operation was simple enough to measure the results, and this is not as bad as the first sample, but now it's just 57 times worse to do the same operation with Ruby. So it's possible to do the same operations, but I think that 57 times lower code can be a problem in production. Sometimes it's not. Depending on the amount of data you were sampling, depending on how many times you do, something that takes part of a second or takes 30 seconds in production, sometimes a day is not a big problem. But if you are doing it lots of times, this performance difference can be a big problem for you. So I think that Ruby and Ruby on Rails are way better to write business web applications. At least I like to write web applications in Ruby. I do it all the day. And we can even do some really good machine learning with Ruby, but it's subject for another presentation. And I defined last year that my objective is to help Ruby developers to use the best tools for each job so they can solve hard problems with less bugs and have more free time, because I like having lots of free time. And that was the time that I found PyCall. And it's a really cool library that wrapper around libpyton.sl. I use it mainly to call Python modules from my web applications to isolate what most of my business code is in Ruby. And most of the data science code is in Python because I found it to be faster and it has more documentation with what helps me a lot. This is a code sample of Ruby code using NumPy, for example, but you can use any Python module. You just use PyCall import and you keep writing code as if it was Ruby. Works really well. I'm not telling you to write all the data science code in Ruby using the Python libraries that probably is not a good idea. I haven't passed it, but any interface code needs to transform data from one module to the other and it can be, it has a cost. But my opinion today is that Python is way better than Ruby for data science. Ruby is better for web and business applications. I have lots of things in Ruby that are not web-related, too. And I tried three different integration patterns. First thing I tried was just pointing both applications, the Ruby one and the Python code to the same database. It works writing twice the database layer works, but it is not ideal. I tried exchanging data through JSON, but it is low sometimes. And my preferred option today is to just use Python to write the data science code, wrap it into Python module and call it from the Ruby web application, where I just passed the data, the same active models using PyCall, get the result and render it in the reports for the business end. I think this is the most important slide in the presentation. It has pointers to really good sources. The first one is RubyConf 2017 presentation, Kanto Murata, where I found PyCall. There are big data analysis in Ruby. The third one from Dan Carpenter is really good, too. And the last three, these Ruby machine learning resources is a really good and complete list of machine learning sources. And these Ruby data science resources has a list, compiled list of many different documentation, documents and libraries and presentations on how to do data science in Ruby. And the last one is the link to the PyCall project. So that was it. If anyone has any questions, we can try to answer. If you want to talk to me later, you can scan this QR code with my contact information. And thank you for your attention.