 Let's get started with the next session. This is Mr. Bhan Babi. He has come down from Delhi to give us this auditorium and join us for the hack night. So we'll be doing the auditorium right now. And let's get started. Hi, good afternoon to everyone. And I hope nobody's going out to sleep. It's very obvious. Thank you at ASGEE for inviting me. It so happens that they've contacted Gaurav at Dixar with whom I am a trainer for R and then things follow. So I'm basically a freelance consultant. I work on projects in R. And that's my brand worker. And then I also do some training. Before we get into this, I would like to first understand how many of you are actually R users. It's three of us here, three, four of us here. So that's going to be a bit of a challenge for me. But that's OK. So have you guys used some kind of things that we used? There's just one I can ask. I use R for statistical analysis of biologists. So we use R to analyze our data and some virtualizations that go with the analysis. Bioconductor? Sorry? Bioconductor period? No, no, I love that. You don't use that. How about it? No, I don't use that. No, you? Sorry? Well, some way. It's a question mostly. OK. So I use GGplot and mostly the plotting stuff for exploring data and other kinds of data transformation, stuff like e-shape and fly up. Mainly most of highly becomes stuff for exploring data. OK. So are you a data scientist? Yeah, I work with the ad firm. I used to work with the ad firm. That we used to use it for optimizations. And initially, we used it for data explosion to find outliers, for finding with fraud and that kind of stuff. And I have tried to automate some of those parts. But it's mostly for data exploration and finding out where the anomalies are. Once we come up with the algorithm, that kind of stuff, then we would code it back in the SQL source or Java to make it more efficient. Because none of us was sophisticated enough to integrate R into the products. So let's return to what can be done. So I guess you have to go much slower than what I had planned. So what I had is a fairly large stuff which can probably consume three or four hours as well. So we'll decide. The pace, maybe after 15 minutes, you tell me whether the pace is good, slow, et cetera. And maybe you can speed it up. You can skip a lot of things and go over to some slightly esoteric things as well, if you like that. I'm a statistician to start with. So I like to top up things like immediate continuous data, or discrete data, and so on. That's what appeals me better. Let's talk about immediate continuous data to start with. So here we have something on a continuous scale that has been measured. For example, birth rate. Birth rate of some 2700 children born at some particular hospital or sent to me like. So there's a data set BW parameter. And then BW gives me the number of data points that are there in that data set. Obviously what one would do if one is clearly looking at the numbers is to look at the summary of that data set and give you the numbers minimum, maximum, median, mean, and so on and so forth. But that still doesn't give you too much of a picture. So the simplest way of visualizing it is to look at the box plot of our univariate data. The box plot is looked at by calling the box plot function. The argument that you pass is the vector BW. And then I want it horizontal instead of vertical, which is the default. So I say horizontal is true. Not true will give me this notch here. So that actually much better shows what is the mean. The width of the box, as you would know, is the distance from this. So this is the distance from Q1 to Q3, the first and the third part. And then there are whiskers, which are up to 1 and 1 half times the interval range. Now this can be changed. You can specify what you want to be, your end points of the boxes and so on and so forth. And these things that you see here are the outliers. So it's a very, very obvious and simple way of identifying outliers. While we are on the box plot, but there are three different types of box plots that are commonly looked at. The first one is the box plot that you have seen. The second one, before I get into the three types of box plots, there's something called par. Par is something like a low level function, which prepares the plot k here. Par starts from plot k. And what do you do in base r with plotting? It's like in control. You tell it, I want to plot this, it puts in there. And once you can input out there on the paper, you can't agree with it. You can always tell that I want to prepare things in a certain manner and not put in there. That is also doable. But the default is, it will put in the paper and after that you can't change it. So you start by saying, I want a plotting area. Make a frame row-wise with three rows and one column. Another row, three one. C is a combining operator. So I essentially have three rows frame. So this entire thing that you see, in fact, this presentation has been created in tech with all the figures created as PDFs and imported in this presentation. So on this side, of course, I'm showing you the code. But the entire figure that you see here would be created as one figure. I'm not adding two, three figures on one slide at a time. So how do you put these three figures together in one plotting area? This is how you do it. Then, of course, the box plot that you saw. The second one is an adjusted box plot, which is available in the row-wise base pattern. Now, adjusted box plot has the capability of, is designed to look at not only the median and the quartiles, but also looks at the skewness of the data. Therefore, it is better. You can see that the number of outlines that you have identified using the adjusted box plot is different and, in fact, you are now showing an outline on the lower side of the range as well. So which tool do you use? We'll determine what kind of conclusions you are going to draw from the data. So you better know what tools are available and which is the appropriate tool for the given data. The next one is a violin plot, which is created, which is available in the wire plot library. The function is also wire plot. Again, you call the same vector, this color here and so on. So what you get is not only the box plot, but actually you get a distribution of the data on the box itself. So you can see that there's a long day on the right and so on. Of course, next one is the most obvious way of utilizing a limited data, which is a histogram. So this is a histogram with labels shown on the top. It can be created using the function test. So at this x label, y label, give me the labels on the axis. The scales on the axis are created by default, but you can always pick it. You can specify everything that you want to, whether you want so many breaks, you want the range to vary from so on and so on. All that can be specified. And labels to actually give me the labels as the whatever is the height of the box. The histogram essentially gives you the frequency of each bin. Now, binning algorithms, there are several binning algorithms you can choose which one you want. This is the default. So I don't do that, a curve, a density curve. Sorry, we have the code over here. It's all right here, right? I can give you the code as you like, that's all right. No, I mean, if it was like on GitHub or something, then I could just load it. Because I have a technology, rather slow, so I can give you the code. No, that's fine, that's fine. I don't go on GitHub, I'm sorry. So, anyway, so here's the histogram, and then you can add lines. You can add as many lines as you want. You can add another histogram on the top of this and so on. So I have set lines created by the density function of VW. Now, earlier we had some inkling that the data had a long writing. Now, this histogram is actually showing you that there's probably another histogram along with the density function. You can see that probably there is another population which is being mixed in this data, right? So that's the kind of information that you can get if you look at the same data differently. But I increase the bins, okay? Earlier I had, whatever were the default bins, now what I'm doing is, I've added number of breaks. I've added more breaks in it, and it's very clear now that this is a different population which has been mixed in the data, okay? I go ahead and show you that, in fact, what I've done is a mixture of three populations, okay? There is one population just sitting right here. If you see there is a lack of symmetry in this data, therefore there's another population which has been mixed out here. I have third one which is here, okay? All this is very empirical though, right? I mean, I know that I've done it before. It's easy for me to tell you that this is how it is. So this is a mixture of densities. How do I show all this? Along with this, I'm now showing you an adjusted box plot as well. I'm using a single x-axis scale, a common x-axis, okay? So I create another way of creating a layout is using a matrix. So I'll define a matrix which has a single column and two rows. Layout works the other way around. Then I specify a layout. Layout, as per this matrix, because the matrix comes in, goes column wise. So I create a layout as per this matrix and the heights and the bits can then be distributed between each of the rows and columns of the matrix. So in this case, there are only heights. So I'm saying 80% of my plot area is this histogram and 20% is the box plot. Okay, so you can play with all these things. Then I'm changing the margins of the plot area. So in the plot area, I'm saying the margin should be zero on the bottom for the histogram. Four on the left so that I have the axis, the labels and the axis label and the takes and so on. So I have four lines on the left, three on the top, one on the right and box type is null. So I don't want any box around the histogram. Otherwise by default, you get a box around the plots in most of the pieces. So I've got rid of that. There I put the histogram and then I'm saying I want to plot the second axis. So in fact, when I plot the histogram, I actually say that access is false. So no access are actually drawn. So as I said, it's income paper and you can specify what you don't want to put. So no access are to be drawn. Then I say that I want to plot the y axis. I don't want to draw the x axis as yet. Then I go ahead and do the box plot. The box plot again has its own x axis that takes care of one of them. Okay, of course I have done some work scaling so that the histogram and the box plot have a common scale. So in the histogram I have specified that I wanted to range from zero to 35, sorry, two to 35, in fact, zero to 35, okay? So that in both the cases, the same scale is used. So that's how you can combine plots and so on. And this is what a stress region would like to see, right? A QQ norm or what is the standard probability plot? That then clearly tells you that this is a mixture of data because this deviating from the expected distribution of the population. So you have a QQ norm which got the points. There are too many points that we are just seeing and then there is a red line which I have added which is the QQ norm which is what one would expect if the data is normally distributed. And the two things moving away are indicating that there are data from different, either the data comes from a different distribution or it's a mixture of distributions. And of course there are tests for actually knowing that it's a mixture of distributions. Having talked about continuous univariate data, let us look at discrete data, okay? Let's look at the univariate data, univariate data. Now this is Topic's most visited on English Wikipedia on 31st May, 2013, the day after Rituborogosh died, okay? So obviously the views on him jumped up. So you have him right at the second point, the second position. Interstitially, Kat Anatobe is also very popular. And so on and so forth. Two plots in a data. And if you don't have too many or if you give this kind of data to R, if you have a bar chart like this, whatever, there are N points in the N categories in the data. So I create a bar chart of the data. Now Wikipedia as I have in my R is actually a vector with levels as the names, names of the pages visited. Or the topics visited, okay? So those labels would automatically appear in the bar clock. Again, I have specified that I want horizontal, otherwise by default it would have been vertical. Then look at this, added colors. Now there are various color palettes which are available by default in R. One is the rainbow, which takes you across the entire spectrum. Pretty dirty if you just say rainbow five, you get five colors across the spectrum, but useful. Topo colors is the set of colors which are typically used in the toposheets. And there are, what else? There are topo colors, there are CM colors, then, what else? There are several such palettes. If you take one of these, rainbow, and rainbow or topo colors, you'll get all of them. Then I've added text as the labels. And that gives me these labels on the bars. Okay, so whatever is the actual number of views on the bars, have to be added as text in case of bar clock. In case of histogram, you will get the labels directly at the top of the bars. In case of Bostock, that is not a residual, but you can easily do it by adding free text. Another very popular kind of showing this data is to add bars and accumulated view, right? View percentage and so on. So you want from two axes, right? On one axis here, the frequency, on the other axis, you want the cumulative proportion. So I just create a PVT as the cumulative sum of the proportions of sorted data scores. And then I go ahead and use two odd dot plot. So on two odd, then, I'm plotting it, then order. You have right and left, X and Y, you specify those and you get the plot. Am I running too fast? Are you okay? So, good, thank you. I hope to finish something. Then of course, you have the popular pie chart for this kind of data, which I hate being a sensation. You really don't know what is what, what is how much more than the other. You also have the order called 3D blown pie chart or whatever, exploded line in pie chart. So that's also available in pie 3D. You can use that. And then there's something called dot chart. Dot chart is useful that you have huge number of categories, okay? So all that you show is a dot against each category. And then if there are a large number of categories, then they also compress into a small area and then you actually see a kind of curve or distribution. Now here I use the rainbow colors, but then they don't show up very well on the blue background. The next data that I want to talk about is a biorheed category data. The data is on hair and eye colors of 500 students in a particular university. This is a data set available in R. It's called a hair eye color. And it also has the third dimension of our agenda, it's available in this. Most of it, I mean if I'm not specified package, it's available in whatever is the value line installation of R. So obviously there are several hair colors and eye colors and what is of interest is to find out why the association between these colors, right? You pass this data to a bar plot and you get a stacked bar. Okay, if I do just bar plot, I get this kind of stacked bar plot. Of course, I have spice it up a bit by adding colors. My colors, I claim to match them up with the hair color, brown blue, sorry, eye color, brown blue, hazel green and then I have the hair color as the four bars. Then I added a region of the red and top. So this is how you add a region. You can actually specify the X and Y coordinates here or you could just say top right, top left and so on and just put it there. The region says attributes of hair eye color. So what are the attributes of that? So from that I am extracting the dimension names. One dimension is of course hair color, the other dimension is eye color. I am extracting only the eye color dimension names. From that I'm getting the four eye colors. And then coloring them using my colors. My colors is just a vector which contains these four colors which correspond to brown blue, hazel and green. If you want to show the same thing as a grouped bar, which is again a very popular way of showing it, then all that I do is to change the bar plot and add it, I quit deciding to do two. There's something else available in R to look at our myriad category data and that is mosaic grid. Y by rate, in fact mosaic. So you can add levels to this. So now this plot is very, very informative. In the sense it's very clear that blondes come in blue color, blue colored eyes, right? So most of the blondes have blue colored eyes. It's very obvious here. Brown goes with brown and brown goes with black. It's very obvious, right? So mosaic is a good way of looking at categorical data and you can have multiple categories, multiple levels, but you just have to be careful about how you choose colors and make things look good. Otherwise it's best to leave them as just empty boxes and compare the area of the boxes, heights and levels. Of course on each axis, you also get to see what are the relative proportions of each of the dimensions in the category data, right? So the function is simple mosaic plot, let's try to look at a pretty old cost data from 1973 to 1974 models. Fill consumption and 10 aspects of auto movie design and performance for 32 different automobiles from the motor training. The data set is called empty cost. It is again available in our standard data set. So you can play around with that very easily. And in case of a biogate, continuous versus categorical data, this is probably one way of visualizing the data that I like. So more often I'm talking of addresses, what is the right kind of plots, what are the right kind of plots for a given kind of data. So, well, here it is. If you want car manager function of cylinders, okay, four, six, eight cylinders. Then, well, you just have a box plot of MPG as a function of cylinders using the data empty cars and here you get it. Of course, I have added colors to that, but that's okay. You get the box plot. Then there's probably one car which is really, very bad. So it is eight cylinders, not of room, but no peel efficiency. However, if I had two continuous variables, which is car mileage as a function of weight, then the appropriate plot is a scatter plot, okay, and that's what you would get. I'm just showing you a slightly different way of calling functions or structuring functions in R and just seeing with MPG. So that different columns in the data frame can be accessed in the function inside directly by their names. So I don't have to say empty cars, dollar weight, empty cars, dollar MPG and so on and so forth. So it makes life easy. I'll just save it. I'll close everything within that and I'll just see plotting weight against MPG and explain those variables. Picture lets you choose one of the, what are, 30, 40 different symbols that are available. You can also replace that with any character, any glyph that you want to use. Color you can, of course, choose for the plotted glyphs on any of the plots in fact. There are three functions that you can use. This is just a pointer to that. Identify. So once you say identify, you can go and click on points and then it will give you what exactly that point is in terms of its x and y values. Okay, so it will plot. It will mark it there, it will also give you that information on the console. So some interactivity of this kind is available. And there's something called locator, which can be used to locate text on the graph. So if you want to specifically write some text you say locator and you click here and put it here. So you can manually add things to do a plotted area. Typically, kind of, what you do in Excel. You want to add it over here, you just take a text block and put it there. Something like that. And then there's grid, which can add the grid lines. Again, something popular but not really always required. Okay, on to that. Again, some specifics. You have this kind of plot. You can add a regression line on it. A simple function, A, B line. You can illustrate this square fit of x versus y in the color red. So it's a straight line graph that will be added on to this kind of plot. You want to add lowest, which is a cubic line on to the data. You can add that using just lines past the lowest function to it and plot the fitted line. You can keep adding things on to a plot. So if you want, you can add a histogram of car weights here itself. You can add a histogram of minus four there and so on, all that can be done. There's a package called car, which does all these things in a row. There's a function scatterplot there. You just say, mpgs function of weight. The data is empty. So you have the scatterplot. You can add a different fitting function so you get that and you also get the linear fit and you also get box plot of the either side. Something that we have already constructed step by step in different cases. So all that is can and readily available for you. Now similar to the box plot, there is also a bivariate box plot which is called the back plot. Not really very interesting, but if you are interested in the back plot of weight and mpgs, so this essentially creates an envelope of the linear 25% and the 75% data 25, 50 and 75% data. And also, if you can see a red dot here. Scatterplot, if you have a large number of data points, then it just becomes a dirty dark cloud over there. And this outline of the shape of the cloud is all that really you get to know. But you would be interested in also knowing what is the distribution of the density of those points in that way. So a very simple way of doing it is to play with the alpha transformation. So in your plot, all that you can say that the color is what are a very low alpha. So every plot is a pretty light disc. But as the number of points keep increasing, the density goes up. And that is a very simple way of doing it. Quick and dirty. Here is a better way of doing it. There is something called hexagonal binning. So your two dimensional data is now converted into some kind of hexagon. Each hexagon has a color density or actually a color itself. Which gives you how many points are actually there in that bin. So there is a case called hexbin. You can say hexbin of my y. Now my r is related to r of r. So hexbin of x versus y. I want 50 bins. So it is in a way also a histogram. Each bin has certain frequencies and then you just plot. So when you say plot the bin object, you get this kind of plot. Now this plot function is something that I have introduced once before when we plotted the scatter plot. Now r is completely object-related. So depending on what object you pass to the function, the function tends to be a bit different. And that is one beauty of say the plot function. And hereafter what we are going to look at is how plot function behaves if you pass different objects to it. So next couple of slides we will look at that. So first we saw that if you pass two continuous data points or two continuous vectors to the plot function you get a scatter plot. Next we are looking at a bin. Bin is a kind of is it by itself a data structure. So this class would be bin. Class would be hexbin in fact. Time series by itself is a different data structure other data type data class where you have associated time information with the series. So I have picked up the L, N, O, C surface temperature data which is 11 in the T series package. Now this window function just pulls out certain subset of the data. Dino 3 contains data set for I think last 50 years or something. I am looking at just 10 years data. So I just accepted a window from 1992 to 2000 and I plot T. T T is an object a time series object. And I plot it against time. Otherwise I would have got this one losing on the excess. Next of course what I have done is identified the peaks by clicking there. So that helps me identify what is the frequency of the cycles and you can there you will see that in the year 1997-98 as I got disturbed and we had it out. Then of course you can add text also. So if I have not to do identify then I could have used text and function to find the local maxima and mark it out there. Next is decomposition The decompose function in time series decomposes the time series into the trend the seasonal effect and the random rate. So if I say I want to plot the decompose in your series I will get 4 plots together that is what it does to a decomposed object. So decompose object would plot as 4 different things the observed signal itself, the trend the signality and the random rate. That is what plot does. If I have a multilayered data iris is probably one of the most commonly used demo data sets in R. This the data has its own history. R.I. Fisher is supposed to have collected this data on iris plots of 3 different species of iris called syscosa, virginica and the data set contains some 150 points 50 on each of the plots and there are 4 measurements sepal length, sepal width, pettil length and pettil width. So there are these 3 different species of large and if you just pass the data set iris to the plot function what you get is this kind of a matrix of plots. So if every single plot is say for example this one is pettil length versus pettil width and of course to make it very obvious what I have added is color color as iris plot of a species. So every point takes the color of its species. So it is very clear that this is syscosa because I know. So this is syscosa and it stands out very well on the pettil length itself pettil length versus pettil width. It has distinctly smaller pettil lengths and pettil widths. So these 2 there is some kind of continuity right? So there is a lot more similarity between virginica and vesicola than with syscosa. Now this data set has been used for demonstrating performance capabilities of whatever, logistic regression decision trees and clustering and so on and so forth. Any classifying algorithm and they tested on this and nobody really can differentiate the 2 species clearly on the basis of 4 measurements. Next is a linear model. A lot of you might have used linear regression. So you have a simple linear model of MPG as a function of weight. If you pass the linear model object to the plot function you get these 4 plots. You see one plot at a time and you have to just keep pressing it up. So instead of doing that I have just created a 4 by 4 or another 2 by 2 grid. So plot area is a 2 by 2 grid and then it plots all the 4 in a single go. So different objects are passed to the plot function plot method and you get different kinds of plots. Here you get all the necessary things that you would want to see about a regression object. I have done clustering of the empty class data ok. And then the next thing that I have done is plotted the cluster object. You get an integrand. Entirely different objects you get entirely different plots using the same plot function. So which plot should I what kind of plot should I have what kind of graph should I have. Use the plot method you will know at least that is what say they would do. Decision tree again the same thing I use the R part to create the decision tree and I pass the R part object to plot function I get the decision tree. What I have done is here of course is to use R part dot plot package which gives you a PRP function which is which gives a much better looking tree. Otherwise you get a very plain looking tree using plot function. This is of course good enough for exploratory work. Next you have financial time series it is a multivariate data because you have the open low high close series. It is a time series data again so against time. So this kind of data can be fetched and worked upon using the point mode package. Point mode package is a kind of backward package over all the typical functions that one uses when handling financial data. So you can actually start by getting the symbols, use the symbols of say in this case Yawu series Yawu stock from certain date and then I will look at it. There is a function called chart series. So you get what is typically shown the prices the candidates are the open low high close and on the lower pay you have the volume of stock which was pretty good. It gets from this particular I think the default is finance.yawu.com but you can choose there are several series that you can get from. I saw there was a manual method where you put statement tables of finance.yawu.com How you can do that? You can always do that. Use curve, you take whatever you want to take and then you clean it. But then there are calls to the APS you can do that. So you have multivariate data in mix mode and you want to visualize that. Probably the plot function available are not so that is why you should start looking at two of the most commonly used packages one is the lattice which essentially is based on the idea of conditioning. So you have a data series given a particular variable what would the rest of the variables look like. So it is conditioning on the values taken by one or more of the variables in the dataset the commonly available or used methods in lattice are x y plot, level plot and various panel functions we will look at those and the other one is ggplot2 the package is ggplot2 people typically refer to as just ggplot which is implementation of a lot of graphics in R. In essence this is ultimate of what you can do with plotting because it has everything segregated out and you know exactly what you are working with when you do a particular thing. So when you are presenting about degree.js it is very easy to relate each activity that you do there with which component I will be touching in grammar of graphics. So whether it is the data of graphics, whether it is a point or a line and so on and so forth or whether it is the statistical operation whether it is a box plot that is going to create a histogram of meaning or where it is then whether it is the scales what kind of scales logarithmic scale etc that you want to use then the facets meaning the conditioning that I talked about in lattice the same thing facets the coordinates whether you want a an euclidean coordinate system what is it called? Radial basically so etc and then there are certain options that you also have always available to play with so this is your x5 plot in the lattice package this is slightly different data set on cars resale price as a function of mileage and model so there are cars which go to a large reseller and he has to set prices of each new car that we each old car that he buys and wants to resell so what should be what is the right resell price that we should set on every new car that he wants to sell every old car that he wants to sell that is the problem that we are working on so in the past he has this data he has data on several models the prices on the y axis and each individual panel is a plot for each model ok and the scatterplot in almost all cases shows you that is all that you really need to look at mileage versus price maybe the year of mate would be another dimension but almost not there mileage takes care of the problem probably so how do you plot that you just say require lattice so that you start using the package and then just say x y plot of price as a function of mileage given the model ok a typical way of seeing in statistics so price of mileage given model and given model gives you for each model and that's it is to add regression lines for each how do I do that add another function so for each panel I am saying add a function so first thing is just the x y plot that is the scatterplot so that adds the blue dots or bullets in a model x versus y span is equal to 1 means that each regression line is going to span a single model you can actually change the layout and say that I want this regression line to span across 2 plots and so on and so forth as well ok so that can also be done and then I am coloring it and so on next if you have continuous versus several categories so on my x axis also I have categories variety of Barley ok so this is one famous Barley experiment so it's a design of experiment kind of example where Barley production was to be increased obviously and so they experimented with several varieties in several different sites so each plot is for a site so we have to look continuously from and so on and so forth ok so yellow plot for each site and for each variety I have just plotted a single point ok now this obviously if you were to plot a scatter or all of them together and so on a very interesting fact would not become a scatter now if you look at the colors of the plots you will see that there is something going on at models another dimension was of course here 1932 ok what you will observe is that only at models the blue dots are higher up than the red or the pink rectangles in all of the cases rectangles are higher right so then that could very well be data issue but this becomes very apparent if you use the right plot so what I do is this part was here yield and function of variety given sites data is valid now on top of that I am saying groups are years when you say groups are years another panel is not created for another level of panels is not created what is done is two different series are plotted within each plan this again could be a combination of several fields in the data so those many series could be plotted and then those are identified here I want them stacked so I have on the graphs stacked one over the other otherwise typically it could have been a 3 by 2 kind of a grid that would have been created then I have added the key to the right so auto key is this arguments to auto key have you can give a lot many things in that so it is faster than this on the right space use the space on the right x and y levels and then to rotate the x axis levels I am using rotate with the plot rotating the levels I am not using one of the easier functions in ggplot would look like this there is a function called qplot so what I passed to it is just simple length comma fatal length and the data iris and color by species so I am plotting only one panel out there in the grid that we have seen just the fatal length versus the circle length and I am coloring them by the data iris data species so I am specifying that the size of the bullets should be in proportion to the petrolet therefore the bullet sizes are varying and then I have added an alpha for transparency in the bullets because once you started in bigger bullets then they start bulging into each other and then on the right side you automatically get what is the scale which is the identifier and so on in fact what I have used is in the size I am using log of petrolet not petrolet if I use the petrolet the bullets are much lighter if you have several time series to be plotted together using again the qplot function this is what you get there is a data set called orange available in again ggplot which is growth of orange trees annual growth rate and the circumference of the tree stem that is what is being plotted and for what are 5 different trees so you get these 5 lines this is the default plotting of such data in ggplot using the qplot but that is not really all another qplot there is another data set called diamonds so in the diamond data set you have price, carrot, cut and several other pieces so price is considered a function of the carrot and the cut typically so the cuts are graded as whatever ideal premium very good fair so the fair cut diamonds can have large carrots but the ideal cut usually do not have large carrots in this but the prices are high which is what people who buy diamonds will know I won't but that is not the data and you actually what I have done is I did geometry and there we had a something called geometry now here I have not left it to what geometry is to plot I have said that I have one point and a smooth curve so it has fitted a smooth curve for each of the subsets of data so I get 5 different lines those are curves and you see how the prices change but that is not the entire story as I said it gives you controls on all the 6 7 things that I those 7 things are the defaults that are already available then layers so each thing is a layer that I am stacking onto the plot and then of course scales and the coordinate system each layer by itself is defined by the data the mapping that you want what against what the geometry will be used that we have looked at a bit and then use and the position we are going to have on the entire plot area we will take one example where I am going to expand this particular ggplot into each of these components and demonstrate I am running a bit short on time if I want to cover plotting on maps and so on and showing some interactive things but let's see how I split it up which is to plot create a plot something like this which is human development index versus the corruption index this is a data available from your NDP you can download it for every year this is 2011 SBI versus CTI country wise so for each country you have the human development index and the corruption perception index and then we are also of course every country belongs to a certain region whether it is America, the APEC, the Africa and so on and then you want to add some kind of fitting curve on that so let's go ahead and do it so first thing that we do now ggplot as well as lattice in that is actually an object that you create and then finally you say that I want to print this object okay so ggplot I am creating this object as pc1 as just a ggplot of using what data the data series is stored in a data object called dat a data frame called dat so that's what I am plotting now AES stands for aesthetics so what are the aesthetics that I want AES means whatever is being shown okay rest you can supply it and it will not be crowded it will start do the calculations for it on the x axis I want on the y axis I want hdi and I want to color the points by the regions to which they belong very similar to what we have done earlier in case of plot function nothing new then now this is where it starts looking different I add to that geometry of the points is using a particular shape so similar to the pch argument in plot you have and there are various changes which are available you can also add your own next I want to specifically add labels of certain content here which are often addressed or appearing different so I create a vector of labels so I have got a chart Afghanistan, Nigeria, Bhutan and so on and then I again go ahead and add to pc1 geometry called text which is what we would have done using the text function in case of the base plot function again yes that is because I want to actually put it on paper with labels as countries so that is what it will do I want it in black color of size 3 I will not be adjust so against the point where is it that you want to place the text that is what I am specifying so what I have used is only the subset of data available in the labels list otherwise it will try to fit anywhere then finally I am adding a smooth line geometry which is essentially a polynomial fit the method is linear model so here we can actually specify what kind of methods we use we can actually create our own function and pass it to black and the formula we use is the polynomial of second order and as I said you can create your own as well and then finally by default what you would have got is the typical gray background with grid lines on it that you have seen in the earlier cube plots instead I want the black and red so there are different equal themes I will go as well again so I use the theme VW and then I want the scale on scale X continuous you can say that I want the X axis label as what a convention perception index 2011, 10 is least and so on same thing on the Y axis labels and then the legend earlier in the earlier plots if you remember the legend was appearing on the right now this I want to move to the top the only digit position is top and the direction is horizontal otherwise it will keep the end as rock on the top that is how you can play around next what I want to show is the high graph package the high graph package high graph package is used for analyzing graphs there is graph theory graphs, networks and so on okay there is an excellent high graph demo function available in the high graph package itself I don't think I will be able to do anything more than that here so just turn the demo and look at that there are different types of graphs that you can different nature of structures of graphs you can look at those and see very very nice demo of that this particular one is social network of friendship between 34 members of Karate Club in the USB tries to show what is the cohesion in the team next thing that I wanted to show is our google maps okay with our google maps you can extract a google map give it set up that long and to zoom level you can also specify the boundaries and you can specify the type of map that you want and then on that map you obviously save it as a P85 you map false essentially says that if you download it once don't download it again I have a data frame called QLOC the data frame means latitude, longitude and this particular case was actually a tiny work that I had to do for an ecologist so they do surveys on various equations this is a typical activity which is done before the project is approved on the ecological aspect these days so they have to survey the area from the ecological perspective and what is plotted right now here is what is called the Shannon's diversity index so every point was a survey site and the attribute plotted was the Shannon's diversity index for that particular point and again the plotted points are again the size as well as the color is a function of the index itself so the function used this plot on static map the arguments are very similar to any other plot function there is also a package called SP for plotting special data that's a very powerful package you can fetch map data at various guadalatins of every geological survey or geographical survey so GIS maps that are available are now available in this format so our data format itself is being offered by most of the surveys so you can pick the plot in here from them directly this is the URL and they in fact go and point to individual countries own individual services so sorry survey survey so this is obviously a graph of Switzerland and which area is predominantly of which language so there is a vector of languages for each county or district whatever it is in case of Switzerland I am just using SP plot on this gdm object and you get nicely colored areas a beautiful much better implementation of that is this this is unemployment data this is called a corrupt so every district in US has been colored using that there is just 15 lines of code the person who has created it has used our dvam color scheme and our color group of packages not necessary you can do it just with SP and your base color palettes then this kind of representation might be interesting then for interactive graphs there are packages available one is gmovie but we use more of a teaching tool I would have like to show it but now we are running short of time another thing called R plus high plots this has a java interface for plots then there are two other products one is R Apache package which can be used to create R web service the other one is shiny shiny has basically two scripts one is the UI script and the other is the server script server script is where you do the analysis UI script is where you just create a search and you just call the run app for that you are directly and that's it I can probably show that how much time are you giving me 5 minutes then I will run it you can do it later on then there is also R plus google chart who is using google miss so essentially the google motion chart and so on those can be called so what happens is you do your analysis finally plot it using java's motion chart and it is rendered in the project then I will show you a couple of posters which is what would probably be a good kind of target output of your hacknits so all the plots out here including plotting on the maps is done using ggplot this is in fact he mentioned the next one this one is again created by the creator of ggplot so you are involved in this project idly venom that's essentially all that I have I will leave you with this the facebook activity graph which was created a couple of years ago my similar facebook itself similar graph I can show you a photo for it first question is that given that ggplot2 has a lot of grammar for a presentation of the graphs the vertices notes and the lines everything and given that p3.js also has a lot of grammar in the form what's exactly the difference between how ggplot represents it how d3 represents it is there any efforts in order to use ggplot grammar and then bring the graph up in d3.js because I would rather like a dynamic interactive channel rather than static one your question is too big to be answered but what I can tell you is that for example people must have seen that this was that in the last elections so the d3 that was created when you have things I have created something similar for some certain other problem where we create the json and then pass on to this so you do the analysis you get a json that is fine but why not use the grammar that is there within the ggplot in order to represent d3.js what does that take why do you want to then go out of the ggplot I mean it's a language for representation many of you can take it so the other question is we discussed about the classification of iris class if these three are the petal length, sepal length and the petal this is not enough for a presentation of classification what is the actual basis for the classification obviously right you go back to the domain people saying that this data can give you this level of classification if you want more give me more data which is meaningful relevant etc from the domain point of view the reason the question came up was maybe the classification of a virologist is wrong if there is a natural clustering at a certain point and if the virologist says this belongs to this maybe the virologist is wrong maybe or in fact that's how people actually say that there is actually a third fourth species of virologist that's how people do that that's a relative research methodology alright so on the last one for the spatial data getting a json and representing it on the space on spatial data is not a big problem but where is the data available for storing all of these you know the the diagrams the districts of united states for example whatever you earlier displayed so where is the data available for this yeah the representation this part should be green this part should be red then you should just take how they are represented it seems that also a part of the fact itself or we have to explicitly get or and then if you do it if you want to do it for a country other than the united states where is the similar form of data available for the rest of that so the language vector is actually a vector corresponding to the districts okay so I would have a data frame of district versus district in the one column and the attribute that you want to put on the second column okay so that's the data frame and so I look up the district and color the district accordingly that's what this does the SP plot does that the SP plot has built in information you are not going to understand this part is Washington this part is New York see the the GDM that has been extracted from the URL that is a list data structure okay it contains apart from the identifier which is the district number or whatever it is it also contains information in terms of what are the boundaries of that district essentially the extra coordinates several extra coordinates to be there in different ways they can be represented so have you heard of shapefiles right so it's a shapefile the shapefiles are built into the data structure no the shapefile is held as as a data structure as a data structure in GDA so there's a shapefile and then there are the identifiers for each shape and against those identifiers I have the attribute as a shapefile I'll go for the other parts of the world yes that's it and up to a pretty good level you can have it at I think even the district level like in India any other questions any other questions any other questions any other questions any other questions have you done excuse me for plotting see if I want to basically take data of something like Remlin or a graph database and plot it using the I-graph is there a way of normalizing the data output out of something like Remlin or Neo4j for example plus I4 if I want to take data out of a graph database because it all comes in the form of vertexes is there an interface that art provides for normalizing that kind of data I-graph package is more about analyzing and doing all these things than about plotting plotting is of course one aspect of it or it's just one aspect of it so for relational data where you have relationships between objects what is a very good package to go for a common visualizer let's say there are two objects there are two way of relationships between both so if I want a same visualizer because what happens in a JavaScript library like let's say arba.js where you try to plot out a graph with complexity it's very hard to filter stuff out try I-graph try I-graph any other questions if that's fine enough I want to propose for one