 Hi everyone, so the next topic would be RMED and Loaded using R on OpenShift. Thank you and over to Ricardo, thank you. Alright, thank you. Alright, so good afternoon everyone. I'm glad to be here at DevConf. I had the opportunity to go to the latest two DevConf in Bruno and I think it's a very good conference to share some experience with developers for other areas and I'm very glad to be here. So speaking about experience, what I'm going to show is a bit of my experience, a few of my experience with data visualization and data manipulation and a bit of my personal initiative to bring R on OpenShift. So let's get started. Well, I think most of you know what R language is but for those who didn't understand what R language does, R is a language that, believe it or not, it's a 25 years language so it's pretty old and its main utility for the language is data manipulation, calculation and graphics display. So it's very useful. Now they're talking about big data and machine learning. We're talking about using R for statistics, data exploration, data visualization and machine learning. Well, I personally love the language but why would someone use R for data manipulation? Well, R is a very useful language not only for software developers but those who are not software developers can use R too. It's a language with a very easy syntax and not only that, there are many libraries that covers many, many applications and you can download the libraries through Cran. Cran stands for the comprehensible R archive network. Particularly I'm trying to find the comprehensible in the Cran but that's just a personal opinion. However, be careful when you try to load a huge amount of data because R uses in-memory calculation so be careful if you have a huge data set to manipulate. It's not recommended for this kind of situation. Alright, just to show a little bit about how easy is R for data manipulation, I'll start our session here. So this is the R prompt. It's like a bash. You can run commands in here. So I'll start reading a CSV file and as you can see, I have type completion not only for commands but for paths. So I'll use the US election CSV and as I want to handle this data, I'll put in a variable. Alright, so now I can check the first six lines of this data. I can check the latest six too. I can, with summary command, I can look at a little bit about my data. Like for example, I have total votes 2008. It's a number column, so I have some measurements of the data distribution like minimal, maximum, quartiles, medium, and in case of a text column, like for example, counting, I will have the frequency of each word inside my data set. As I said, R is for statistics, so I can also calculate the standard deviation of for example, total votes 2008. So it's 52, 4, 2, 0.69. And what else? Well, I can also plot the data. So let me try, plot the distribution of total votes for 2008 and I have a very simple graph to show the spread of my data. Alright, so what about companies? How many companies are using R today? Well, there's Facebook for behavior analysis. They use a lot of it in their related to status updates and profile pictures. Google uses for advertising, Twitter for data visualization. Microsoft and IBM are part of the R Consortium. It's a special group who maintains the R Foundation. Uber for statistical analysis, Airbnb, ANZ, and so on. The search for the list of those companies is below. And what else do we have for R? Well, as I said, R has an extensible repository of libraries and one of the things that makes R very good is to create quick data visualizations around your data. And in particular, there's a special library called Shiny, which is a very useful library to create dashboards in a web application style. So you can use, just let me show the Shiny web page. So you can create dashboards using your data. You can create some visualizations, create some interactions with it and using some of the standard components from Shiny. And you can also change the style using CSS, themes, and so on. So R as a dashboard is very good. In my opinion, it's a very good option to make quick visualizations and to publish in a web application style. All right, so now we know what R is. Let's talk a little bit about cloud and containers. So we have currently this form of using cloud computing and containers and most of the requirements of the motivations around that is mostly because of what NIST defined as the cloud computing. So they elected five requirements for cloud, which is self-service on-demand resources, broad network access, resource pooling, rapid electricity, and measured service. So everything we didn't have in standard data center style architectures we have in cloud, mostly because there is the biggest motivations around the technology itself. All right, and why I chose OpenShift to run R? Well, I've been using OpenShift since the alpha version, maybe, well, four and a half years ago. And one of the things that makes, in my opinion, OpenShift a very good platform is that it's a developer-friendly environment. You can, with just single clicks, you can publish your application and you have a DNS name for it to access externally. You can also have some pipelines so you can split your workflow in stages where you can test and deploy your application in production. And now with OpenShift P3 you have all the capabilities to create microservices style for your applications. All right, so mixing cloud computing and ARM makes the image which is published and available through the RedNet.io project. It's a project aimed to bring some machine learning and big data capabilities to OpenShift and we have the Apache Spark as the central component of it. It's one of the biggest projects coming in the big data area. It's a very useful tool to make large-scale computing around your data. And ARM is around this language supported by Spark. We have also Java, Scala, and Python. So, and in OpenShift offer also the S2I style workflow to build your application so for those who doesn't know what S2I is, S2I is the standard workflow in OpenShift to create your own application image by just providing the base image and the search code in a Git repository. And the R image has also its own dependency management. Well, although we have the CRAN repository which has all the libraries started for R, we have this problem that we don't have a way to provide a metadata where you can just tell the dependencies of your application. So, this is one of the challenges I did in the R image so I needed to create my own dependency management mechanism using a kind of a metadata file in a very simple text file. So, let me just show a quick demo with the R image. So, this is my repository where I started my Shiny application and this thing I would like to bring some more attention to is the dot R libraries file. What does that file means? Well, this is the file I told in the last slide about that makes the dependency management mechanism working in the R image. So, as you can see, it's just a very simple file with each line mentioning the name of the libraries required to build your application, your R application. And this is honestly the only difference or the only capability I created for to make S2I works for OpenShift. So, we have also the main entry point, the app dot R which calls the server and UI. It's the front and back end for Shiny application. I'm using some custom CSS and JavaScript file. And in this example, I start my data files inside my repository but I could also use an external storage to load the data. Well, yeah, I think this is basically what I have in my repository. So, going back to OpenShift, this is my project where I have the US elections data which contains the last three elections data, the number of voters of each county for the last three elections. All right. So, this is the visualization. One second. Okay. So, I created a GeoJSON file with all the county limits and then I merged the US elections data inside the GeoJSON file. So, what you see here is the map of the US divided by the counties and the color says the number of voters of each county. There are many parts with the same color. The problem in this data sets that most data is below 400,000 voters. So, most of the counties are with the same color because of that. But you have also, like for example, let me check another one. Like here. Nope. There are some counties here with more than 10,000. Maybe here. Yeah. These are just very few number of voters. But as you can see, if you point to the county, you can see the number of the county and the name of the county and the number of voters in that county specifically. Also, I created a very simple page to explore the data. So, there's a table here with the state name, county name, the party, which received the votes and the number of votes of each election. So, for 2008, 2012, and 2016. I used the shiny components to create a very simple filter so I can look at the Massachusetts state and E6 county. And I have only the data related to the state and county. You can also use the other filter here. So, like for example, E6. And it shows all the occurrences around the data. Okay. Lastly, I have this other visualization, which is bar graph with the votes by state in the 2008 elections, but I can also choose the 2016. So, the data changes as I'm changing the parameters. And I have a checkbox to color the votes by party. So, these are what the colors do. And there's the legend of the graph here. All right. So, this is a very simple dashboard I created with the US elections data. This is just a demonstration of the capabilities for our image inside OpenShift. And I know that there are lots of improvements I need to put that. Like for example, I'm going to prepare more of the base S2I images to support machine learning features. Maybe I need to do some more research like GPU scheduling, add full Apache Spark support. There's support for Spark Liar, but I like to also add the Spark card, improve build times because the CRAN repository only stores the search code for each library. And the process is basically download the search code and compile the whole search inside your image, your container, and that makes the build times like for example, in the US election application, it takes about 16 minutes. So, I'll try to come up with a better way to improve the build times. Also, I have only the image streams for R. So, I'm going to create some templates to fast creating applications using R. And try to find a better way to handle the fantasy management. So, that's all I have so far. So, before I finish the talk, I'd like to thank everyone who attended this talk. It's my first time talking about R. So, I was a bit nervous. So, if anyone have any questions, feel free to ask. Thanks, Ricardo. That was really nice. When you were showing the REPL, the R interface that you had, one of the commands you ran was like a summary and it showed statistics about the various information, your data source. The R prompt, you were saying? Yeah, yeah, the R prompt, yeah. So, in that summary, can you actually index each one of those pieces of information to pull it out programmatically? Like if you wanted some of that information? I don't think I follow your question. Well, could I run the summary command and then say I want the top entry for county name or would I have to run a different query to do that? Like would I ever use the summary in my application? Let me see if I understood because you're saying a little far from the microphone, I'm not hearing you. So, the summary, would you ever use this command inside an application to get the summary data? So, if I understand your question, the summary, if it's kind of part of my application, is that... Would you use it in an application? Oh, okay, right. Well, to be honest, the summary and some other commands that ran here is just commands to do some data exploration, but you can also... You have other commands to build applications, but the idea behind some of these base commands is to have a very quick insight about your data. So, for example, the summary is just a base command to know what your data is, how is your data. Like, for example, SR is mostly used in statistical calculations. Summary usually brings all the statistical data around your data. I'm doing with all the data set, but I can also choose some specific column, like, for example, which is better. Also, in this, I need to use a special library called deeplayer. It's kind of an advanced data manipulation tool, but I can also use, like, for example... Do you know, in Shell, you'll have the pipe command where you just get the output of one command and passes the input of the other command. We have the same in deeplayer, but it's not the pipe. It's this strange symbol, which they call the pipe, but you can also, with election data, you can filter by total votes in 2008 below 4,000. And we have the data. Let's just do another thing. Sorry. So with the output of that command, I would like to select only the total votes in 2018. And with the output of that command, you can also use a box plot. And there it is. So what I did is just... If I use the same command using the raw data, it will be a very difficult box plot visualization to see because of the outliers. But then what I did is just I filtered the data to get the data below 4,000. And then I selected only the column I'd like to visualize, and then I called the box plot just in a single pipeline. So all these commands are mostly used to get some data exploration. And Shiny could be useful to create all the visualizations to publish in a web page style. Does that answer your question? Yeah, it does. Thank you. Cool. Hi. Can you hear me? A little bit. About now. So when you start by saying this tool is easy, this is easy, it's not for developers. You don't have to be a developer to learn it. You were talking to me. So how... a few questions. You use a comma-separated CSV file. Do you have to format that in any particular way to get the way... Does it have to be formatted in a specific way for R? Or is it just any CSV file, number one? No, it can be any CSV file. Actually, CSV file is just one of the services I could use, I was handling another dataset for atmospheric data, for example. They have a special format called NetCDF. It's a binary data, so it's very hard to use in standard text editors. But I know that there's a library inside R called RNetCDF, which you can read the NetCDF data, and then I created another... I created a function to read the NetCDF data and export to CSV. But it can be whatever CSV, as well as have separators, and you don't need to have the first line as the call names, you don't need to specify the first line as the call name. You can also have the roll names, but you need to specify if you need or not to read the first column as the roll name. So there are lots of options to read and CSV data. Not only CSV data, but every other format that could contain data. I'm intrigued by this because I just had a project that had CSV files all over the place, and I had to use grab, set, or to just kind of get what I needed and print out what I needed. And our language, as you presented it, would have been great if I knew it two weeks ago. So for a newbie, how much effort would you say it takes to get to a point where you can create reports or something like what you just showed today? Thank you. Well, I'm going to finish that R session, and I'll start again. Can you see that that last paragraph, you can see there are two very helpful commands. There's the demo command where there's another, there's a list of very simple use cases to use R, and there's the help for the online help. So when you type help, it will open a very simple HTTP server and open your web browser with all the online documentation for R. There's also some very beginner R tutorials. So for me, if I would begin learning R, I would start with one of these two commands. All right. So thank you, everyone. Thank you, Ricardo. Thank you, everyone, for attending today, and we hope to see you tomorrow.