 So thanks to Yiling giving a very high-level introduction about our different types of charts. I'm continuing with some hands-on visualization in R. It will be quite fundamental. And it's good that if you have the environment set up. So how many people has pulled the docker image? Okay, how many people have RStudio and the packaging stuff? Okay, cool. All right. Today I'm going to focus on GGplot2. GGplot2 is a popular visualization package in R. And the concept behind GGplot2 is you can define your base plot then add layer on top of it. So the common layer is like, okay, so firstly you have your data and map it from data to your plot. Then after that you can add like a geometric object which is like what type of chart that you want to plot. And a scale transformation and add some stats summary to it. So for today's workshop I'm going to use a package, a collection of package called Teddyverse. It has like GGplot2 unit and a bunch of other very convenient tools like Deploy R or Table, Radar, TeddyR. And that's like, so R has its basic packages and Teddyverse is like one level on top of it. It gives a lot of convenience. Yeah, but today we are not going to focus on that. Just use it for the plotting for GGplot2. And the data set that I'm going to use is a data set, a TMDB data set from cargo, which has 5000 data information in the data set. Cool. So let's jump to the R visualization hands on. So I have a script here and it's a git repository on the URL is this. So if you are using RStudio native linear machine, you can just pull this ripple. It's an R project. You can just open the project file. And if you are using the Docker image, how many people managed to get to this web app running RStudio on web? Can you share the password? Yes, sorry. The password and username are RStudio all lowercase. Yeah, so you pull the Docker image, run the app, then just go to port 8787. Sure. Yeah, and if you are using the Docker image, you can pull this ripple from the app. There's a terminal. I'll show you later. Yeah, so there's a terminal type here. Now you just do a git clone in it. So if you go to the terminal type, this is the console, the R console, this is terminal. You just do a git clone, blah, blah, then yeah, that'd be there. So all good. All right. Let's move to the code. So if you go to check the files, if you have downloaded the source code, it will show you if you go to this folder, R visualization, WW code, there is an R project file. Just click on that. Okay, I already opened it, but if you haven't opened it, click on that, then there should be a window asking you whether you want to open the project. Just yes, then yeah, that's all set. This is the username and password for the Docker image. Not that Docker image, the R running on Docker image. Let's see the other one. Okay. Okay, cool. Okay, let's move on. So I'm just going to give a simple explanation about what's here in this project and how we're going to continue with the workshop. There is a data folder. In the data folder, it has two data sets. One is the TMDB data set, which we are going to use for the ggplot2. We are going to touch a little bit about one interactive package called plotly, which allow user interaction. The other data set is Game of Thrones. We are going to analyze the network for the characters in the book. Firstly, I have two files here. One is the IMDB.R. Well, I was going to use IMDB data, but I changed to TMDB data, so the name hasn't changed, sorry about that. This has the implementations of the exercise that we are going to go through. The other is the exercise file. It has the blanks here and there. If you like, you can follow me and type yourself. We are going to plot some simple charts like the scatter charts and some box charts for analyzing the relationship between the rating, runtime, and release dates for different movies in the data site. After that, we are going to separate in different genres and see what we can find there. After that, if we still have time, we can plot the network for Game of Thrones characters. How many people have used R and R Studio before? We are going through this file line by line. I am sure you know the shortcut for executing the current line of code is command enter. I am just going to import some certain libraries that we are going to use later. If you can import all these libraries, meaning that you have installed the package properly. Sorry, Teddyverse is now available. I installed this in the stocker container from GitHub. I tried the previous version but it is not compatible with the 3.4. Besides that, I was able to do everything. It is not available now. So you are installing using the DevTool and you can't do it? It is not available in the granted. You have to install it like this? I tried it, it doesn't work now. This doesn't work. I am using 3.4.2, I don't know whether there is a difference about that. Everyone can import the libraries? I am using Teddyverse as well as Dplay R. There are some similar things. If you see some errors saying that there is a filter and some other function is already defined in another library. It is just a warning, it is fine. It is like some conflict between the Teddyverse and Dplay R because Dplay R is part of Teddyverse. The order actually matters. You need to import Teddyverse first. I want to use some of the Dplay R function so I import it later after that. For this too? For this too. The others are individual so it is fine. Because I have the data set already in the data folder, I am going to read all the movies into an object. If you run it, you will see some warning message saying there are some problems about reading the data. It is because the data is now clean and has some non-values in it. We are going to clean it up a little bit later. It is just a warning, you don't need to worry about it. If you type call names, I will give you all the names of the columns in the movie object. You should be able to see. If you type call names, you should see that budget, generous, home page and stuff like that. I want to use the rating data from the user. Okay, everything is alright. If you can't... Well, you probably can't. Where is it? Let me find it. Tidy.json is not part of our native package. You have to install from the GitHub repo. Before that, you have to install the DevTools. If you have DevTools, you can call the GitHub and give this name to it. You should be able to do that. I am just going to rename vote average in the column to rating because it is easy to type. Rating is just like more common things to call. In line 8, I am going to rename the column. Then I am going to call the call names again and see whether the change actually happens. Yes, here it is actually called rating. Then this is a very small change to the dataset. If you execute this line, it is removing the noun data for release date. If you are using head, you can see what is the first... Tidy.json. It is because the data that gives the genre as a JSON object. I need to gather it and transform it to a longer table. I am using it to data transformation. It is not very relevant to the visualization for the purpose of this. It is just to transform for we can use the genre properly. The dataset is actually providing several data in the JSON format. One movie can have different genres. It is JSON as well as the key word and directors, if I remember correctly. The filter is from the Deployer library. The filter condition is saying that this percentage greater than the percentage sign is a PEP. It PEPs the output of the things on the left as the first argument to the next function. It is actually saying that filter movies... If the release date column of movies is not noun, then keep it. It is filtering out which observation with the release date as now. It is not available. The head is just to check the first several observations in your table. The theme is dimension. You can see the dimension of the movies. It has 4802 observations with 20 columns for each record. It is the first chart that we are going to pull out together. If you remember the comment for checking the column names for the movies, let me just do it. Column names, movies, they show you what columns are available there. Now I am going to plot the relation between ratings and runtime. I already imported the library Teddy versus has ggplot so we can just use it freely. If I say ggplot, it will ask me for some argument. You can say data equals to something or you can just ignore the variable name. Here I am going to use the full name. Data equals to movies. Then I am going to define x and y. This is my ping, your data to chart part if you remember in the slide. AES stands for Aesthetics. If you give it Aes and say x equals to runtime and y equals to rating, those are the column names in the movie that I said because you have defined the data here so you don't have to use the R syntax saying movies dollar sign. You can still use it but you don't have to. It's the same. This is the base plot of ggplot. If you just tap this and execute this line, nothing will happen. You can see the chart here, nothing will show up because you haven't defined what kind of plot that you want to add to the base plot. This is another layer that we are going to add to it. The syntax is plus, g-e-o-m. I'm going to plot the scatter chart at its point. If you do comment it again, something shows up. If you click on zoom, it will give you a bigger window to let you see what's going on there. As I said, this is ggplot2, it's not very interactive. You can see the plot but you don't know what's exactly happening. There is a very simple and sweet way to convert ggplot2 to an interactive plot chart is just to assign the ggplot to a variable. I'm going to say plot equals 2. The shortcut option minus mac is to type the arrow thing. I assign it to a variable, then say ggplotly and pass the variable to it. If you rerun these two lines here, I got the wrong line, not here, somewhere above. I ggplotlyp and rerun these two lines. If you click on this, it will show the plot in the browser. When you hover your mouse on it, it will show you the x and y you defined. This is one way to convert ggplot2 to plotly chart. Let's continue. Now we have the chart here. We can see some of the points doesn't have a length. Some of the movies have 0 rating or 10 or 10 points. Let's clean it up. If you go to here, we are going to create an object called TeddyMovies. It's basically piped the whole movie set to filter out the time. I'm only interested in the movies. At least we'll have some runtime and less than four hours. I believe this point and all these points will be gone. Now if I plot it again, cool. You see the points lying here and there, they're gone now. I also changed the color of the plot to blue. If we look closely at the chart here, we can still see some of the things having 0 points or 10 points rating. Before that, let's plot a bar chart to see how many people actually vote for that. If nobody votes for those movies and they still have some ratings, we just get rid of it. It doesn't mean anything. I'm going to use the TeddyMovies and plot a bar chart here. As we can see, the Y-axis is the voting count and the X-axis is the rating. Nobody actually votes for movies below 2.5 and over 9, so we're just going to filter that out. Here is a second exercise to filter out the movies that doesn't have a rating, which makes sense. I'll give you a couple of minutes and maybe try to do it yourself. Pretty similar to the runtime you can refer to the code above. After that, if you plot again, you should be able to see at least no outliers lying on these four edges. Now we're moving on to something a little bit more complicated. Now we have some charts which describe the relation of rating and runtime. I'll add another dimension to it as the release date so we can see how the trend is going over the years. If you check this piece of code, it has a similar base plot. But the X-axis here is release date, which is another column in the movies dataset. And Y is rating. I'm going to use the scatter plot again, but I'm associating the colors with runtime, which is the third dimension. And add another scale layer to say, OK, so I have color associated with dimension. And I want the movies with the shorter time shows as yellow. And the movies with long runtime shows as red. And also change the X-axis label to it and add some title. Then let's run it. You can see a plot like this. So as you can see, the X-axis release date is from somewhere around 1920 to 2017, I believe. And the rating is something like this. And the color is indicating the runtime. So we can see that there are more movies made in the recent years. And because of the number of movies are increasing, the range of ratings are increasing too. The colors are sort of mixed with each other. So we can't really tell much about the color here, but it's OK. We can try to plot with the other dimensions. We can rotate the runtime and the release date and the rating dimension and plot another chart. So the next exercise is to plot the same color scale scatter plot with Y-axis runtime, X-axis release date and color scale over rating. And let's see what the chart looks like. And if you want to see what the code is like, you can always check at the mdv.rfl to see what the implementation looks like. So in this chart, we can see a little bit of the color difference of the red and the yellow. There's more yellow at the bottom corner here. But we can see there's something there, but we don't know what movies or what's at the actual length of the yellow dots. So let's convert it to ggplotly and check what actually those yellow dots represent. So after converting it to plotly, we can see that the yellow dots here are around 80 to 100 minutes. So we can sort of tell that in recent years there are a lot of movies around 19 minutes that doesn't have a very good rating. And for plotly, if you select a certain area, you can zoom into that certain area there to see more details of the points. And the double click is go back to the original chart. Okay, so now we know that because ggplotly is taking the variables that you defined in ggplot2 and show it as a tool tip there. So it shows you the rating, the release date and the runtime. But if we want to know what movie is it, what's the title of the movie, we can do this. We can actually use plotly to do that. So I'm going to get a plotly chart. It's plot underscore li. Then you define the data here. So in plotly you use tlday to say that it's the dependency variable. So, okay, dependency variable data is tidy movies. X is release date. Y is runtime. And we want to show text as title. Tilda is the dependency. It indicates that this variable is a dependency of your dataset. Yeah, I forgot the tilda here. Okay, so now you can see the titles printed there as well as the release date. Okay, so that is about some play around with the runtime and release date and the ratings. Now let's look closely to the genres of the movies. So before that, I want to check what kind of data is in the movie set. So if I just call head and query the first observations genre, I can see what it is. So it shows me it's like a stringify JSON. And for the first movie it has three genres associated. One is action, one is adventure, one is sci-fi. So we need to do a little bit data transformation to convert this JSON to the format that we can use. That's when the tidy JSON comes into the picture. I'm not going to too much details about how this is converted, but the data that converted to is, for example, the first record has three genres with it. It will just gather it and have, for example, three different records with the other columns with the same value and the genres with different value. So if I run this line and I query the first three records of the movie by genre object, you can see it's translated to three records and it's as a string format. Okay, so now we have our data with the proper genre information. The next I'm going to do is to have a box plot for ratings versus genres basically to see the ratings based on each genre. So ggplot2, the same, but at this time we use movie by genre and we define x as, x is genre, y is rating. This is our base plot. After that, we define the geometric object, g-e-o-m, box and point function. Okay, it's box plot. So now we have a box plot based on each genre, but then it's a little bit hard to see as the tags down here are like overlapping with each other. In our studio, I think if you zoom more in, you can see it, but let's just make it more colorful and more different between each genre. So in the box plot, I can define my own style. If I say, okay, fill with genre and run it again, I will show some color. And yeah, it's more clear which one is which one. And I will have the indicators here. So this is box plot. And as Yulin just introduced, what is this chart mean? So this is 50% percentile and this is 75%, 25%. And this is the mean value and this is the max value. And the dots here and there are the outliers in the dataset. So if we don't want to show the outliers here, we can define it in the box plot saying, okay, an outlier dot shape equals to NA, then we plot it again. Now the dots are gone. And we can see that, well, horror movies doesn't really have a very good rating over all these years. TV movies doesn't really have a good rating compared with other genres. This documentary has a good rating overall. But then before coming to any conclusion, because we know that some of the movies are more popular than the others, right? Like drama actions are more popular. There are more amounts of those movies compared with, for example, documentary. So we also want to see, okay, what's the frequency for each genre? How many, like, what was the difference of the amount of movies produced from 1920 to recent years? So the next line, I'm going to generate the frequency and use ggplot2. Okay, this is a to-do but I already have the answer here. So, okay, yeah, you just fit the frequencies in and get a bar chart. Well, it's not quite obvious here. Let me do it in our studio. Okay, now it's better. So obviously, drama has a lot of movies. And documentaries compared with dramas, like, not very much. Foreign movies, no. TV movies is even fewer. Yep. So all this chart, if you want to convert it to interactive plot, then you can always assign it to a variable and pass it to ggplot2 and you can see the numbers. I'm going to convert the box to ggplot2. There is it. Wrong, wrong, wrong, wrong. Here, sorry. Now you can see the numbers indicating different ratings for each genre. Yeah, well, yeah, ggplot2 doesn't handle everything. So we can't interpret that. We are hiding the outliers. So I'll show you. Can you point out the X labels? Sorry? The labels on the X-axis? The labels on the X-axis? Yeah, it's overlapping so much. Yeah, so if you decrease the size, then it's show. Yeah, or if you have a bigger screen. Yeah, I think you can do that too. But I'm not sure how to convert it to vertical. Yeah, I think there is a way that you can define your aesthetics to, for example, now the text is too long that you can make it vertical. Yeah, I forgot the syntax. You can always Google it. And yeah, okay, so one tip. So if you want to check something, you can always, in the console, you can type the question mark. And if you say gomboxplot, it will show the help for that function. Yeah, you can always check things from here too. Okay, yeah, back here. Okay, okay, is it too small? The font? Okay, all right. So we plot the bar chart here, and then let's plot something. Okay, so I'm interested in, for example, action movies. So I'm just going to filter out the movie set, which represents based on genre. And only have a subset of action movies. And I'm going to plot the ratings over release date for action movies and see how it's going over the years. Oh, this is about action movies. I add another smoother to it, so that the gray things here is representing the standard era. But it doesn't speak too much, because recently we have so many more movies than early years, so the standard era is smaller. But you still can see the trend is generally not as good as before. And then I'm going to compare three genres, when it's action and sci-fi and animation, and I'm going to plot the budget over these three genres of movies. So the same filter function, I'm going to filter out the three subcategories, the plotted. And there is another layer that you can add to ggplot2, it's called facet grid. So basically, if you don't have this layer, so if you don't have this layer, it will plot it all in the same chart. And I plot it in different colors based on genre. And if you use facet grid, it will plot three different small charts and compare them side by side. The scales are the same for the three charts. So I'm saying that I'm plotting the facet grid based on genre. And this tilde here is also indicating the variable dependencies. And the dot here is just a shorthand for the variables that you have in the ggplot2 before you plot it to a facet grid. So if I reverse it, I'm saying, okay, let's use all the variables before that and have a dependency of genre. It will look like this. And again, you can use plotly to do that. Similarly, I'm giving the data. Then I'm saying, okay, x equals to tilde release date, y equals to tilde budget, and text equals to title. Color depends on genre. Plotly, sorry, typo. Yeah, there's a couple of things that you can define like the trace and the marker. And you don't define it, you just like go to the default, but plotly will show you like some warning message saying, okay, I'm using the default. Yeah, okay, now I can see that this is the most expensive movies over all the three categories. It's 380 million. And Avengers also expensive. And another Paris of the Caribbean, Superman Returns. So it's a lot of superhero movies, isn't it? Cool. So that's about ggplot2 and a little bit about plotly. I have three exercises down here for you to do in the end of this workshop. Yeah, so I give you like 10 minutes to do it. Then let me know if you have any problems. Then we continue with the Game of Thrones Network. Y'all get it? What's the name of the most popular movies across drama and action? Y'all are that they can use for plotting. And there are like several interesting interactive libraries that I want to introduce. One is the Vist Network. It's as the name says, it's visualization network. Basically it's plotting for the plotting for network relations. And the leaflet is mostly used for maps to show different information based on the geolocation. And yes, plotly is one of the popular general purpose library. And the shining is a framework for exporting your interactive plotting tool to a web app. So the good thing about plotly is it works perfectly fine with Shiny. You can just use plotly to generate the app and generate the graphs. And use Shiny to export it to a web app. So now I'm going to use Vist Network to show something about Game of Thrones. I also used a data set called Game of Thrones from Kaggle. If you go back to the R studio, there is a script called GOTNet. So we import libraries, Teddyverse and VistNet. So VistNet requires your data set to be in a certain form. It requires the nose and the edges. So if we have a close look at what the data looks like, you can type view. After importing the CSV, just view nose, maybe the first three columns, the first three observations. Okay, well, the third, sorry, and this is wrong. First three. And it has a name and some attributes to the node, culture and known house in Game of Thrones, labels, titles and super culture. That's some information about the node. And then if we take a look at the edges, it has to have a column called source and target. It's basically connecting, removing the relation to connect the node. So we get the edges and node and we do some cleanup. And I'm going to add two of the icons for two of the characters in the Game of Thrones. So VistNetwork only allow you to use the photos from Internet. So I don't have a server to serve all the images for each character. I'm just going to change two of them. One is Jon Snow, one is Daenerys Targaryen. So it's column image that you just put the value in the image column for a certain node. And then call the function VistNetwork, fit in the node and edges that you processed. And the pipette to VistNode shape properties. So I want to use the image with water. So I'm going to define it to be true and pipette to the eye graph layout. So eye graph layout, it has some algorithm to put the node in a certain order. I don't know what algorithm it follows, but it basically gives you a nice shape of the network. And if you open it on a browser, I'll give it two icons of these two guys. A guy and a girl. And if I click on it, you can see what are the edges and what are the nodes. And it has a name. It definitely can do more with VistNetwork. This is just a 101 to show what it looks like. Cool. So that's about it. That's about today's workshop. Hope you learned something and you enjoyed it. Okay, I'm going to pass it to Yuli.