 the recording. So also welcome everyone on Moodle if you're watching or if you're not able to watch this live. So welcome, welcome. Today we will have a guest speaker like I just said to the other people that are watching it live. Misha will be talking. So the lecture today I shortened it a little bit but I hope that all of the points come across. So it will be fine, it will be fine. And I hope we won't run into many technical issues. So stream will start soon. Nope, we already started. All right, so this is mess up number one. I opened up the wrong PowerPoint. Oh my god. All right, that's the first delay that we run into. Let me see. We're lecture number five now, right? So bloop. No, that's the wrong button. That's the wrong button. Don't get nervous. Just be cool. Be cool, yeah. Stream will start soon? No, we already started. All right, so today, yeah, so I'm professional. I know, I know, I know. I'm not a professional streamer, although I do get paid for it. So that makes me kind of professional, right? But today we will be talking about plots, plots and more plots and this fits in really nicely with our guest speaker team. So let's just start. I'm really excited. Before we start, I wanted to make an announcement because I mailed the pre-funcest bureau and the exam date is still unknown. But since last year, also lecture number five at the information about the exam, I do wanna notify you guys on what you need. So you will be required to have a webcam. You will be required to have some audio, a quiet room. That's the most important thing. A mobile phone to take a photo and pen and paper. So the way that we are going to do it is that there will be a form that you have to agree to online examination and that will be mailed to you. You sign it and you send that form back to me via physical mail. So I will probably just put the form on Moodle, which is probably the easiest. So the way that it works, I put the exam questions on Moodle and they will be hidden. And then when the exam starts, we will have a Zoom meeting. So everyone who wants to do the exam will join the Zoom meeting and then you show your student ID. If everyone has their student ID and I can validate that you are who you say you are, which is always difficult online. But then we will open up the exam questions and the idea is that you just write the exam on paper. I will monitor you during the exam and of course you have to keep your webcam and your audio open. So I know that you're not sneakily having someone in the back of your camera giving you the answers. And then of course after the exam is done, you photograph the pieces of paper that you used and you send me an email before the deadline. So of course the, I don't know exactly how long the exam will be, but that's something that I still have to discuss with the Prüfungspital. You send in the photos and then of course you also have to send in the physical exam via paper. And that's just because it's Germany. So I have to have the physical exams in my hands before I can give you a grade. And this has worked out really well like last year and also in the winter semester. So that's how we will do it. And that's the way that it goes. And of course when I know the exam date, when they are finalized, I will tell you guys. We might also wanna do a poll on when because I hate making reexams. So it would be good that if you, I will think about it. I will think about making a poll on Moodle so that people can vote. So we have the most people on a single day. All right, some general remarks. I already mentioned this in the email that I send around because we moved, but please attend the Zoom meetings. Even if you have no questions, it's always good to hear what questions other people have. So you can always learn from that. And of course, ask questions as many as much as possible. And of course, send me an email if you are running into any problems while you're doing the assignments. Of course, try them first. But if you get stuck and then send me an email just with a question. And I think people are not using this option enough. But like it could be that the assignments are way too easy, but then also let me know if you're saying that, well, the assignments that we have there, like I'm finished with them in like 15 minutes, then that's also good feedback for me that I should just put in some additional harder questions. Also, if you take like five hours to do them, also let me know, right? Because it's feedback for me and it can just, it just helps me make the course better for next year or every feedback is appreciated. I wanted to mention that there's more assignments available if you want to practice more. Because I know that from experience that people only become good at programming by practicing. So you should practice a lot. And if you want more assignments, then there's no issue. We can have more assignments. I can make more assignments for you guys and that will be good because practicing more means that you've just become a better programmer quicker. All right, so off to the lecture, five minutes, good. All right, so I will be giving you an introduction into plotting in R and I will try and explain to you what the difference is between S3 and S4 objects and how they relate to plotting and how you can create functions or functions and objects that auto plot, right? If you have a matrix and you do image on a matrix then R figures out what to plot. But you can also create your own functions that are custom plotting functions for your own types, right? In R you can define your own type and you can put types of data together. And then once R recognizes that you have given it these types of data then it can call a custom plot function to make like nifty plots, which are not possible generally. And of course we will be talking about some custom plots like points, like scatter plots. The thing that I always like to introduce to people is chromosome plots. Since we're working in biology we often have the chromosome as the thing that anchors our information, right? So if you're talking about genes or you're talking about proteins or micro RNAs or all of these kinds of things they all come with a genomic location, which means that when you plot them there is the chromosomes or the thing that we anchor all of our plots on. I will talk a little bit about circle plots and about colors and that's what we're going to talk about today. So in R, objects that you create are called S language objects. So if you just create a variable then this variable has a type, right? And we talked about types a lot. So you have numeric, logical, character and these kinds of types but you can also come up with your own type, right? So we can set up custom classes and this doesn't just hold for the plot functions it also holds for the print and summary functions. So the print and summary functions are functions that are calls when it detects a certain kind of object. And using the class R knows which function to call. So in during the entire lecture we will use my object as the name of the class that we are going to define. So you can use or you can define common functions. So let's just first make a simple object. So here we are making an object and this object is a list. So the basic type of the object is list and the list in our case contains on the first position something called CHR which stands for chromosome. And in this case there's two things in there. So you have element one or something with the name one which has 15 stored into it and you have something with the name two which has 10 stored into it. And then on the second position in this list we have something called genotypes. And genotypes is something that as a geneticist you use a lot. So it's things like a single nucleotide polymorphisms or Lycor markers or other things. But in this case I'm just creating a matrix at the second position of the list and I put in 250 random numbers which I round down and it's going to have 10 rows and 25 columns. So I think everyone is by now familiar with how to create a matrix. And this is just if I would do this and I store it in my object then now of course R doesn't know that this is a special kind of object. I could make not just one thing but I could make like 10 of them which have all got a list. On the first position there's something called CHR. On the second position there's something called genotype. So the thing that we're going to do to make it really an object so that R also understands that it's an object is we're just going to say that the class of my object is a vector which contains a list of course because we don't want to lose the list functions because a lot of built-in functions are already in R which work on lists. But we are going to add a second class and this is called my class. So I now have an object called my object which has a secondary class which is called my class next to the class that it already had which is the class list. And now when I type my object of course it will just use the standard list print function because it looks for a function called print.list and it finds that and it also looks for something called print.myobject and it doesn't find that so it reverts to using the first thing that it found instead of using the overriding class. So it will just use the standard list printing functions and then it will look like something like this so chromosome with one and two and it contains 15 and 10 and then we have the genotypes. But then on the bottom it will print something and it will say that the attribute of this object is class so it has an additional hidden attribute and that means that it's a list but that it's also of type my class. So now if we want to because normally if you would store like genetic data and you have like hundreds of thousands of single or nucleotide polymorphisms on 600,000 animals for example or on 10,000 animals then just using the list print function it will just continuously roll down your screen and it will take a long time for R to print all of the data which is inside of this object. So what we can do is we can make a custom print function. So we define a function which is called print.myclass and this is the definition that the function should have, right? So it's a variadic function definition like we talked about in the function sections and so it means that this function gets something which is x, which it has to be. Denny Shelley, you sound very nice in English. Thank you, thank you. I talk English a lot so I hope that it's understandable and that I don't have too much of a Dutch accent that doesn't shine through. So you create a function which gets an object called x and then you have the variadic argument which is called dot, dot, dot. So these dot, dot, dots means that you can pass more parameters to the function. So in this function, I define a custom printing so every time that I want that something that has the type myclass gets typed into R I want this custom print function to be called and I want the custom print function to print out that it is a myclass object and then a new line and then it just prints the content but in a very, very kind of summarized way, right? So the way that it does that is I summarize x chromosome. So the thing that I put in CHR is the number of markers that I had and then there's a second content, right? Because it has the list as two elements and the second element is the genotypes and I'm just going to now say n-roll of x genotypes. So now when I type my object it will find the print dot list function but it will also find the print dot myclass function and since the print dot myclass function is more specialized than the print dot list function it will now use this printing function that we just made and it will just print myclass object content 25 markers, content 10 individuals, right? So it's not going to scroll down the screen and even if you would put in a genotype matrix which has 100,000 rows it would have a printing which is only three lines. So you can't really overload R anymore and this of course is something that when you start writing code for other people is really useful because you can get like a summary of your main object. So a lot of packages use this to define custom objects. Of course we can also overload the plot function, right? Because if I would say plot my object then it will error out because it does not know how to plot this object, right? There's no standard way to plot a list in R but what we can do is again we can define an overload of the plot function so I'm saying plot dot myclass is a function and again it takes the same input so it takes x comma and x is the object that has the class of course and then it's again dot dot dot for the variadic arguments. And in this case what I want is that when someone calls plot of my object I want an image to appear and this image has to be made of the second element which is stored in my object. So that's the thing that we call genotypes. And then I'm taking the dot dot dots and I'm giving them to the image function. So that means that when people plot the object like this saying plot my object it will make this plotting window but because of the dot dot dots if they want to change the color they still can, right? Because the color argument you can add it so you can say plot my object comma color is gray white or something and then of course it will use the colors that the user supplied. So that's the way that you do it and I use the box function because normally when you use the image function then it won't plot the black lines on the top and on the right side so I just say box to put a box around it. And of course you can make very nice looking custom functions and this is just a way of showing you guys that it's possible to do this. We can also overload the image function of course. So if I would say image my object again, image does not know how to make a graphic of a list. So again we could make the same overload that we did before and now call it image dot my class then now when I do image afterwards it won't produce an error but it will just call my custom plotting function and show the user the same picture as before. So the philosophy behind R's plotting and plotting in R is that it uses the artist's palette model and that means that you start up with a blank canvas and you build it from there. It's like a painter looking at a scene. So the first thing that the painter does is you take an empty canvas and then you paint the background. I think everyone's seen Bob Ross and the way that Bob Ross does it is it takes the big brush and then you make it like blue where you want the sky to be and you take like green and you color the stuff green and the same thing works in R. So you start off with a plot function to create kind of an empty plot and then you use like you plot the data on top of it and then you can use things like annotation functions to modify or add to the plot but the thing is, is that once it's on the canvas you can't easily remove it and you have to then start over, right? And that's the same thing as that an artist works and so what they do is they just paint from the background to the foreground and that's also how R works. So everything that you plot will be on top of the thing that is already there. And of course the annotation functions like text and lines and points and the access that these are just things to make your plots look a little bit nicer and a little bit better. And in my mind this is very convenient because it thinks how most people think about building plots and analyzing data, right? Because you want to, for example, show first your data that you measured on body weight and then you want to overlay. So in the same plot you want to add for example, BMI measurements. The drawback of course is that you can't go back once a plot has started. So you can't adjust the margins later on. You have to make sure that the margins are proper to begin with. And the problem there is that it's difficult to translate once another new plot has been created. So it's not a really graphical language. A plot in R is just a series of R commands which start with making an empty window and then adding one by one to it. And this is very different from the philosophy that for example is done in ggplot2. So ggplot2 is a plot package in R which is used a lot by a lot of people and it has a bunch of beautiful plots. But here the plot is not built up from the back to the front. No, you kind of have a command and then another command and another command. And these are more or less added together to create one plot. So you can kind of take parts out and you can put parts in. And it's just a different philosophy when you use ggplot compared to the basic R plotting routines that you have. But during this lecture, we will focus on the basic R plotting routines and it might be that I can convince Paola to give you a ggplot2 lecture in case that there's a lot of people that want to learn how to use ggplot2. And I can't teach you that because it's not the way that my mind works. I'm really stuck with this artist-pellet model and the way that I make plots is different from how other people might do it. But I just wanna show you guys during this lecture the way that I do it. And if there's a lot of animal for it then we can have a ggplot2 lecture as well. Although we are running kind of short on lectures but we'll have to see how it works out in the end. So a very, very small example. For example, we load the library data sets which has a whole bunch of nice data sets for us. And then we load the data into R using the data cars. So we are looking at a car data frame and this data for cars, it has all kinds of different cars in there, the make, the model, but things like also the speed, the maximum distance that it can travel on a single tank. It has the fuel usage and these kinds of things in there. So here I want to introduce you to the with function. So the with function allows you to take a matrix and what it then does is if you say with matrix then in the second part of the with statement you can use any column directly by name. So that's why I can say plot speed comma distance because speed is a column in the matrix cars and this is also a column in the matrix cars, right? So the with function, what it does, it takes a matrix and then within the with function it allows you to use directly the column name. So you don't have to say cars dollar speed or cars square bracket open comma between like air quotes, speed close to square bracket. So it's just a convenience function which allows you to very quickly address columns in a matrix. And of course, this is what R produces. Of course, this is a very ugly plot. However, it's for first data analysis this plot is really good, right? Because you can see that, well, there is a relationship to the speed of the car and kind of the maximum distance that it can travel on a single kind of tank of gas. So some of the important parameters when you're plotting and we've already seen some of these but many basic plotting functions in R they share a very limited set of parameters and some of them that we might have already seen but I just wanted to introduce them again are things like PCH. So PCH stands for the plotting symbol or the plotting character. That's how I always remember it. So P care, P character. And this is the plotting simple and the default like you've seen is just an open circle and you can say PCH equals one equals two or equals three but you also can give it a character. So if you wanna use like the letter A, you can. So you can say PCH is and then you just have like the double quotes and then you have the letter A in there and then it uses the A as the plotting symbol. If you are doing a line plot you can specify the LTI which is the line type and the default here is a solid line but it can be dash, dotted, dash, dotted. So you have like seven or eight different line types that you can choose from and these you can only specify using integer. So one, two, three, four. We have LWD which is the line width and of course this is when you wanna kind of make one line much brighter than the other ones or much fatter than the other ones. So hey, it's again specified as an integer multiple. So a line Y, a line width of one is the same as 1.2 is the same as 1.3 but then when you go to a line width of two then it's double the size. So it's always in kind of a point type. So you can't specify 1.5 as the line width. The line width has to be either one or two or three so it has to be whole numbers. Call stands for the plotting color. So you specify a number, a string or you can do a hex code and if you wanna know which colors are available then you can do colors and just call colors without nothing in there. So you just call the colors function and then it gives you a vector of all the colors that there are in R and the ones which are named because colors have names as well. So you can say color equals double quote open, blue, double quote close and that's just the blue color but you could also specify the blue color by saying color ish between quotes, hashtag 0, 0, 0, 0, ff. Right, so in RGB colors. So there's many ways to specify colors and there's even more beautiful colors if you use, for example, the color brewer package. But the default colors, you can list them all using colors. Also you can Google it. So there's enough like images out there which show all of the different R colors with their names. X lobster stands for the label on the X axis and Y lobster stands for the character string which is put on the Y axis. So these are parameters to the function that you are using. So you can say plot data set comma pch is five. The par function allows you to set global graphical parameters and during the assignments we already discussed it a little bit because the assignments were making some plots and they affect all of the plots in an R session and these parameters can be overridden with specific arguments to specify plotting function. So these kind of go beyond the standard plot function because they don't apply to a single plot but they apply to multiple plots that you will be making during the same session. So LAS is the orientation of the axis label on the plot. So an LAS of one means that it's horizontal to the axis and an LAS of two means that it's rotated 90 degrees. You can set the background color of the plot. This is especially useful when you're making plots that you wanna use in a PowerPoint presentation because then you generally want the background not to be white but you want the background to be transparent so that you can have something in the plot or behind the plot but you can set any background color that you want. You have MAR which is the margin size and this is the margin size within the plot and then you have the OMA which is the outer margin size which is the size outside of the plot. And this has to do with the fact that you can have multiple plots next to each other. So if you have two plots next to each other then you have a MAR which is how far the plot is from the theoretical window and then the OMA is the size which is between the two different plots but these are just parameters that you have to play around with to make nice plots. We have already seen MF row and MF call which is the number of plots per rows or columns and plots are filled row-wise and here you have the number of plots per row and then plots are filled column-wise and again here I always mess it up so I say MF row one two and then I actually meant MF row two one but now of course R is good at fast prototyping so you can just change it whenever you need. You can also get the current settings so you can use the par function to for example get the current line type. So if you say par LTI it tells you that the line type standard is solid or is currently set to solid because you can change it of course. The default color for plotting is always black. The default PCH is one which is the open circle. The par of the background is white because normally you have a white background when you plot. The standard margins are 5.1 on the bottom. Do you need to set each parameter individually? You need to override them individually. All of them have a default value, right? So the background color defaults to white so if you wanna have a different background color you just have to say par background is blue and then you get a blue background. You can use the par function and the par function is a variadic function so you can specify multiple at the same time. So you can say par background is white, LTI is dotted, color is blue. So you can stack them all in like one call so you only have to do one par call to change like five or 10 parameters so you can do that. You don't have to kind of do par background is blue then par, mar is. So you can specify multiple in a single go. So you can get them and you can also set them of course and we will see how to set them in the rest of the examples. All right so the basic plotting functions in R is the plot function of course and the plot function makes a scatter plot. So it uses like the data that you give it to make a scatter plot so it has an X and a Y component so it expects you to give it two vectors. So you make a scatter plot or plot but you can also specify the type. So you can use plot to make like a dot plot so with points but you can also make lines in the plot directly by saying type equals lines. If you wanna add lines to a plot then you can use the lines function so imagine that I have a plot and I want to highlight something or I want to put a line through the plot somewhere then I can just use the lines function and the lines function adds a line to the plot and you give it a vector of X values and the corresponding vector of Y values or you can specify this using a two column matrix where the first column is the X and the second column is the Y and had the function here just connects the dots and makes a line through all of the different dots. Points is used to add points to a plot. You can use text to add a text label to a plot and you can then specify the X and Y coordinates. So imagine here in the car example that we have if I wanna put the name of this car there I can use the text function and say put a text at like 10, 40 and then put the text here and then it will use that as the X coordinate, Y coordinate and then it will just start the text there. So it's not the middle of the text but it's the beginning of the text string that you give it. The title is the annotation to the X, the Y, the labels, the subtitles and the outer margins and you can use title for that. So if you wanna plot not inside of the plot window but you wanna plot something outside on the top on the left or on the right or on the bottom then you use title. M text can be used to add arbitrary text to the margins. So inside the inner margins and inside the outer margins of a plot and title cannot plot. So the title and the M text is slightly different because the title always works on the outside of the plot but the M text can also be used to plot across. So it's not used a lot but I just wanted to show you guys. The access is for adding access ticks and labels. I think during the assignments or during the discussion of the assignments I showed how to use the access by putting bird is the word on different parts of the access. All right, so for things like heat maps we want to generally use color ranges. So not just a single color but like a nice range. So there are different built in color ranges. So R has like five built in color ranges. So the rainbow colors of course go from red to yellow to green to blue to kind of purplish and back to red. So it's kind of a, it wraps around. Yes, so it's kind of a, it's kind of a circle, right? Because the highest color is the same as the lowest color. So it really has no real beginning and end. Generally for heat maps you use heat dot colors which go from red on the lower part to white on the top part. And generally if you take heat colors you want to flip them around because generally you want to have the low values be white and the high values be red and the default values the other way around but you can just flip them by using the ref function to reverse the colors. You have terrain dot colors which gives you terrain colors which goes from like the really nice green all the way up to white and it's brownish in the middle. And this is used when you make topographical maps, right? Because the zero level is generally the ground level. So there's grass growing there and then the higher you get up the mountain the more kind of grayish brownish it gets until you get to the top of the mountain and the top of the mountain generally tends to be whiteish. You have topocolors which are more like topographical colors and this adds, it's very similar to the terrain colors but it just adds a blue area for things like sea level, right? So you start off like underwater and then you get near the surface and then you have the green grass and then you have like a little bit of yellowish for like the sunset and then it goes up. And the ones that I like a lot is the CM dot colors and I just like it because it's like blueish with a little bit of pinkish and it's white in the middle which is also really nice. So the CM dot colors are really nice when you do things with correlations, right? Because correlations on the minimum scale ranged from minus one to positive one and in correlations you're generally interested in the extreme values and not in the correlations that are around zero. So CM dot colors I find it really useful or really beautiful to use them for correlations if I'm plotting a correlation matrix. Another handy color function that you can use is gray. It's not a range but you can get a specific gray tone. So if you wanna say gray 0.1, it gives you a color which is 10% gray. If you say gray 0.5, it's 50% gray and gray 0.9 is 90% gray so it's almost black but it's useful when you want to use gray tones. And for example, you don't wanna pay for color figures in a publication. Of course, we can also add transparency to our plots. So here you see two histograms, right? That are plotted on top of each other and if you use the color, then you can define a color using the RGB function. You have to give the red channel, the green channel, the blue channel but if you use the RGB function, you can also give the alpha. So the alpha is how transparent something is, right? So normally if I would plot these two, right? And I would set the alpha to one, then the red one would be completely intransparent so you would not be able to see kind of this green part here, right? So that where the green one goes behind the red one. But by setting the alpha to 0.8 or to 0.4 or 0.2, you can see that it becomes more clear and you can see the one distribution which is behind the other one. So you can use alpha colors as well to make things which overlap, still be visible to the person who's looking at your plot. There's a lot more colors in R and if you really want to have like really high quality, good looking professional type of plots, then you can use the R color brewer package and it defines three different sets of colors. So you have the quantitative set which is very good for doing quantitative data. So data which has a range, which is for example between zero and 100. Then you have the linear data which is very good when you're dealing with linear data sets. So linear data sets means that you have kind of a, so the quantitative colors are really good when you do normal distributions. The linear colors are really good when your distribution is kind of a uniform distribution. And then the divergent one is the one which I use a lot and that is because they have like the most contrast between. So here there's some examples of sets which are, I think these are all divergent sets. And you can see here, for example, that you have paired. So you have first blue, so kind of light blue, dark blue, light green, dark green. And these colors have been chosen for kind of the maximum difference in contrast. So if you have data which is very close to each other but you still want to show the difference, then you can use these diverging sets. There are many, many here. So these are just a couple of examples of sets that you have. But if you install the color brewer package, then you can load it using the library function. And then you can say brewer.pal and the end here is the numbers of colors that you want to take from the set. So sets kind of generally have between seven to 12 unique colors that are especially chosen for a certain property. So you also have sets for people who are colorblind. For example, if you know that like, I'm going to present for 200 people, then half of those people will be male. And out of those, like 10% will be red, green colorblind. So you want to kind of then make your plot in such a way that you are not making people who are colorblind just kind of look and guess. But what you can do is you can say brewer.pal, then you specify the number of colors that you need. So you might need three colors or you might need seven colors. And then you say pellet name and pellet name is for example, I want to select from the accent colors or I want to select from set one or pastel two. And then you just store the colors in a variable called my color and then you can use it to select from them. They have many, many different ones. So this is actually the quantitative sets. These are the linear sets. So the linear sets go from one color to another color and they have like colors in between. So they are chosen for kind of maximum contrast in this sense. And we also have the divergent colors which generally tend to have like a middle color. And they are very good when you have data which has positive and negative values. So they show the most difference. So you have for example, if the brown, blue, green one and this is white in the middle. So again, all of these are really good when you are dealing with data like correlation matrices which generally tend to range from minus one to positive one. And the spectral set is one that I also use a lot because I just like the look of it. What if you need more than nine colors? For example, the blues sit in here. Yeah, so you have the greens, the oranges and you have the purples and there's also blues. And the blue colors only has like seven or nine colors in there. So if you need more than nine, what do you do? Because it might be that you wanna have like 50 colors or a hundred, right? Or you want to have a very, very smooth transition. So if you just take the nine blue colors that are specified and you would plot them, then it would look like this. So there's no real gradient that you can make with them. But what you can do is when you store your palette you can use this palette in the color ramp palette function. So the color ramp palette function takes a palette and then you can say extend this to a hundred unique colors. So then it will take all of these colors and it will interpolate colors in the middle. And then when you plot it, so here we just plot from zero to 10 or from zero to nine and here we also plot from zero to nine but you can see that here you have a really nice gradient transition. So this is really nice if you wanna have very nice looking or very smooth looking heat maps or images. All right, so the width function, I already talked about it a little bit, it is a convenience function. It turns the columns of the data frames into variables and that is why the retable has to check names, right? Because if you read in a table, right, then our demands that the column names follow a certain structure like a column name is not allowed to start with a number. And this is because of the width function because the width function, like I said, it takes a matrix and then it makes from every column in the matrix instead of having to select, for example, the ozone column from my data, I can just say with my data and then I can even define a block using these curly brackets and then within the block I can use the word ozone and then R will automatically know that it needs to take the ozone column from my data. So it saves you a lot of square brackets and floaty air quotes just using the width function and I will be using it a lot. So it's a convenience function also during the assignments, use the width function. It will save you a lot of typing. It will save you a lot of square bracket frustration and strings which are not ended. So just a convenience function. So let me show you guys how I build up a plot. So for example, we look at the air quality data set. The air quality data set we already saw a couple of times but it contains things like wind and ozone and temperature. So I can say with air quality plot the wind on the x-axis, plot the ozone on the y-axis and then give it a label saying ozone and wind in New York City. And of course I made it a little bit smaller here otherwise it doesn't fit the slide very well. And then for example, I can use the width function. So I can say with the subset of air quality where the month is five, plot or take the points of wind and ozone and color them blue. So I use the subset function to take out a single month and then I use the points function to just make them blue in the plot. So you can barely see it here on the but this is the way that it comes out of our standard. So I just took our, did the plot, based it in in here but it's not a very good plot for a PowerPoint presentation. You can see that it's very hard to see this on a screen unless you're looking at it on a like 60 inch monitor like I am because then it's still pretty visible but for normal people and on a beamer this would not be good enough. So let's kind of improve the plot step by step. So the first thing that you do is adding a legend. So I add a legend on the top right and the other thing that I did here is increase the contrast, right? Because if I have blue and I have black these colors are very close together so it's very difficult for people to see the difference but the difference between blue and red is really clear for a lot of people. Even if you're like red, green color blind you can still see the difference between red and blue. So I'm saying here with the air quality and the first thing that I'm doing is disabling the plotting of the points because I want to plot the points myself, right? So the thing that I'm doing is saying with air quality plot the wind versus the ozone, add the legend and then don't plot the points. So it doesn't plot the points to begin with and then I'm going to manually plot the points. So I'm saying with and now I'm subsetting the air quality for May, right? So the fifth month and I'm going to say color these in blue and with the air quality subset when the month is not month number of five so when it's not May, plot the other ones in red. And then I add the legend and on the legend I say put it on the top right use psh is one because that's the points that we're using give the colors blue and red and then the legend is May for blue and other months for red. Still the plot doesn't look that good but it's already better visible, right? You can now see which points are blue and you can see which points are red, more or less. So you can do a lot to still improve it, right? So we can make it more visible. So instead of using the default open circle I plotted or I gave you a list here of all of the plotting symbols the default plotting symbols in R. So if you use a closed circle which is psh 19, not 18. No, 18 is the square. So psh 19, there's a little error on the slide but psh 19 is the closed circle, right? So you see that from going from open circles to closed circles, it already looks a little bit better, right? You can more clearly see what's going on. So changing the point and making or using a different plotting symbol can make all of the difference in the world. Furthermore, since this is a PowerPoint presentation I want dots to be very, very visible. So I just set the global parameter to increase everything. So I just say parameter CX so the magnification put it to 1.5, right? And you can see that the title changes the title becomes bigger. You can see that the Y axis changes, it becomes bigger and all of the points also become much, much more visible in the plot. So that's already a lot better, right? You can now kind of see what's going on. There's things that you can do now, right? You can also say, well, I wanna highlight this point, right? So I want people that are looking at my presentation to see that this value here is a value that I think is important, right? Because I can say, well, I take an arrow and I have to manually figure out how to point the arrow but in this case the arrow starts at 3.4 here and it starts at 140 and the arrow points to 3.4 at 165 and then it points to this circle. And of course I can give it any color but I decided to have an arrow here in a red color. One of the things that a lot of people want to show is that there's a relationship between the X and the Y. So there's actually a very, very easy way in R to add the best fit line, right? So the kind of more or less the regression line in a way, right? So what I can do is I can say I have a straight line, right? I want to draw a straight line through my data and the mathematical definition of a straight line is saying Y, so the Y position of the line is A plus BX. So A is the intercept, it signifies the point at which the line crosses the X equals zero, right? Because if zero times B, this part of the equation becomes zero and then you're only left with the A, so A is called the intercept and B is the slope of the line so it determines how quickly it drops or how quickly it rises. So to find the A and the B parameter for my data set, what I can do is I can make a linear model, so I can say with the subset of the air quality data set where the month is five, then I can do a linear model. I model the ozone by the wind concentration, right? Because ozone is my Y and wind is my X and now what I can just say is I can add this line to the plot by just saying AB line. So an AB line is a line which follows this AB structure and I specify A being the first parameter of the model and B being the second parameter of the model. So the model contains a lot more so the A and the B for the model are within the first list element. So I'm just saying model, take the first element from the list and then the first value in this vector, which is there, is the A component and the B component is the second one. Color it red, which is wrong because month number of five is blue, so this line should actually have been blue but and give it a line width of two so that it's more visible, otherwise it would be very, very small, right? So I can just add the best fit line and that's very easy to do in R. So question is, is R we done at this point with our plot? Well, not really because there's still a lot of things that we need to do. We still need to make the axis look beautiful like the numbers here, they are facing the wrong way around, right? No one's going to do like this to read the plot. There are no units, like is this wind in kilometers per second, miles per second, like light years per second, same for the ozone concentration. The arrow needs to be defined in the legend because people need to know why there's an arrow. Can you delete the line of the legend? You mean the line around the legend, this one here? Yes. Yes, the legend function also has like a massive amount of parameters. I think this is the box, so you can say legend comma box is false. Ooh, moderator, could you ban this P3NG? I do want to become famous, but I don't want to buy my followers. I'd rather have you guys just love me and show up because you like me a lot instead of just like buying them. All right, so the arrow needs to be in the legend, the line, the best regression line needs to be in the legend. The legend might be a little bit smaller. You can see that there's a lot of white space here. And of course, you can remove the line from the legend as well. You can also make the background color of the legend a little bit different to make it stand out more. So we're not done yet, right? I've been talking for 49 minutes, so I need to speed up a little bit to give Michel all the time that he wants. But if I want to, for example, make multiple base plots, I can also use the with function, right? So I can say first, give me three. So make a parameter like MF row, so I want to have three plots next to each other. I want to have a margin and I want to have an outer margin because I want to have a little bit of space between them. So I'm saying here, the MF row is one, three. So three plots next to each other. The margins surrounding the plot should be four, four, two and one. So this is four, four. This is two and this is one. And then I'm specifying the OMA because I want every plot to have a little bit of extra space after it because otherwise the legend here would be very stuck to the previous plot. So that is why the OMA is there and the margin is there. And then I'm going to say with air quality and then now I'm going to define a big block because I want to do all three plots in one go. And of course I can't do that in one statement. So I need to define a block using the curly brackets just like we do in a function. And within this whole block, I can just say plot the wind versus the ozone, plot the solar radiation, which is another column versus the ozone, and then plot the temperature versus the ozone. And then add an M text. Yes, so the M text is here in the middle. So I'm saying ozone and weather in the city of New York, outer is true and it will automatically put it in the outer margin. So here the two, the outer margin will, no, not the two here, but the outer margin will be used here. So it will put a kind of title above the titles that I'm using. So the M text can be used to kind of put like as another additional legend or an additional like title or in the plot. Bychards. One of these things that people love looking at and I think that they're really nice to make and they're really easy to make an R as well. So imagine that I have a bychard and I want to add percentages to it. This is more or less the standard code that I use. So the slices are the size of the individual slices and they don't have to add up to a hundred. That's the nice thing about a bychard and R because R will automatically scale them to a hundred percent. So here I'm just going to say that there's the US as 10, the UK as 12, Australia as four, Germany as 16 and France as eight. I don't know what it is but just to have some values for a bychards. So I define the sizes or the relative sizes that I want for each of the slices or the raw measurements that I have. And then I'm just going to define something which is called labels bychards delicious. Yeah, they are delicious and they work really well and like if you can present your data in a pie chart, you should present your data in a pie chart because people eat that shit up and that's really true. So that's why pie charts are really nice. So the nice thing is, is that you can add the percentages to the legends very easily because you can just say calculate the percentages by saying give me the slices, right? Divided by the sum of the slices times 100 so I'm going to calculate my own percentages and then round them down and round just means that I don't want like 16 digits behind the comma. So then I'm going to change the labels that I have by saying paste to the labels, the percentages and then paste to the labels that I had the percentage and don't use a separator, right? So here it puts a space because paste puts and here I could have just used paste zero. So I'm just going to update my labels and then I'm just going to say pie slices. These are the labels that belong to it. Use the rainbow colors and then do a mean. So just a little bit of code. So the pie chart is made using the pie function and you just provide it with slices and it will automatically scale them to 100% and you can just add the labels and this is just like we are recalculate the percentages and we just add that to the labels. You could have done this in one go but I think it's clearer to do it like in three lines of code to show you that, well, we calculate the percentages, add it to the label and then add the percent sign after the label without using any separator and then we use pie to plot it. So dendrograms are also used a lot. Dendrogram is kind of a tree structure and for that we need our data clustered, right? So I'm taking here the car data set which I called empty cars because I removed a couple of columns which were not numeric and of course we can only make dendrograms based on numeric data. It's the same as calculating correlation, you need numerical data. So here what we're saying is we're going to calculate a distance matrix based on the car matrix. So the first thing that it will do it is it will calculate for each car how far away from every other car it is and then we're going to use the hclust function to make a clustering and then we can just plot the clustering using the plot function and then it looks like this, right? So it's very easy in R to make things like clustering plots and the thing is is that you have to remember that you use the distance function because the clustering function always needs to have a distance matrix and not a matrix of raw measurements. It needs a square matrix where every car is on the columns and also every car is on the rows and then for each entry in the matrix it's how far away one car is from another car. If you want to pull all of the things to the same level so if you want to put the camp because here you see that the Maserata Bora is kind of high up. If you just want to pull them all down and have the legends on the bottom because you're working for example with animals and none of the animals are extinct but you are calculating like the evolutionary distance then you could use the hang is minus one and then what it will do it will just take the labels and put all of the labels at minus one. So you want to have them at minus one and not at zero because otherwise they start overlapping with the lines and now you have a little bit of an additional space there. So hang is minus one, puts the labels all at the same level and kind of aligns them together. If you want to show some more fancy dendrograms like this is called a triangle way, right? So this is a standard dendrogram. This is a triangular dendrogram with a single origin. You can just take your clustering object, right? So the object that you used and you used haklist on then you can say as dendrogram. So I call it a haklist dendrogram and then you have many more ways to plot them. So you can say plot this haklist dendrogram, say type is triangle, then it looks like this and you can make it look cool. So this is just a bunch of code on a single slide, right? And I just put this here for you guys and I'm going to very quickly explain what it does because it kind of walks through the three and then based on which group a certain thing belongs to it colors the different things. So to make a haklist dendrogram look cool because the way that it looks here, right? It's not beautiful. Like you wanna use colors and had to kind of specify things. But you can just get the code of the slide in case you ever need to cluster something or you need to color a dendrogram. So first you define the label colors. So I'm going to define four groups in my plot. Why four? Just because I want to have four groups, right? So I'm just gonna take four colors. Then I'm going to say cluster members is cut the tree in four clusters. So you have cut tree, I take my clustering object and I just say four, cut it into four groups, right? So it will go and have like a line going down the plot and as soon as it plots or it takes them or it makes them into four clusters, then it will use this height at cutting the thing. And the thing is is that you have this dendro apply function which you can apply to a dendrogram and this function will just walk through the entire tree and for each node of the tree it will kind of run this function which you provided. So to the clustering provide this color label function that I wrote. So the color label function is a function which takes a node as input. So n is a node, so a point in the dendrogram. The first thing that I need to do is ask if the node that we're looking at is a leaf node, are we on the bottom? Because this is a node as well, that is a node, that is a node. But in our case we want to color the leaf nodes, right? Because the leaf nodes are the ones that have the text in there. So if it is a leaf, get the attributes of this node and then I want to know using the class member function, class member cutting that I did in which of the four clusters this node falls. So I'm going to say class member which names class member is a dot label. So I'm just asking the question in which cluster does the label fall? Does it fall into the first, the second, the third or in the fourth cluster? Then I'm going to use the label colors and just going to select based on the cluster the color that I want and then I call this label color and then I have to set something which is called the node parameter. So I'm going to say the attribute of n, so the node that I got as an input set the node parameter and what do I want to set the node parameter to? Well, I want to set the node parameter to what it was because I don't want to change everything. I just want to change one thing and I'm going to add a label color. So lab.color and this is going to be the color that I just got from my label colors. And then I close it and then I return the node. So I have to close the if statement because I want to return every node, not just the leaves. I also want to return the intermediate ones. So I can then just use the then the reply function and then after I've, so this thing will use this function that I made, apply it to every node in the dendrogram and then I have a class dendro or whatever you want to call it and then I'm just going to plot this thing using the triangle function. And then it looks like this. So this already looks a lot cooler because now you can see that it kind of grouped the different groups together and gave it different colors. So a really kind of simple way to go through and just color the dendrograms. All right, I'm just going to stop the recording and take a little break and contact Misha. I have 15, 16 slides left. So it might be that after we come back from the break that I am still talking for like 10 minutes, but then we will switch to Misha and he can tell you all about cool, closed ecological systems and about plots and Arduños and these kinds of things. So I will stop the recording.