 recording directly. So welcome everyone, also if you're watching this on Moodle, recording is going. So I think we should just jump in. We have a lot to discuss. Well, not a lot, a lot, but we have some things to discuss. So stream will be starting soon. No, we already started. So first off the assignments for last week, plots, plots, plots, making nice plots. And then I want you to give you guys an example of what you can do. Because up until now, all of these topics have been more or less kind of disjunct from each other. So the assignments are more or less about the lecture. But I thought it would be good to show you guys what what you can do with what you have learned already. So what you have learned already is a lot. And I just wanted to take like some things from different lectures. And then I want to show you how you can visualize the COVID-19 pandemic on a world map. And that should be something that you should be able to do by now, more or less with the tools that we've been discussing. All right, so let's switch to Notepad and let's show you guys my answers. Is it big enough? Can you read this? Because I've been muddling a little bit with the setup because I had a presentation yesterday for some guys in America. So it might be that the font size has changed a little bit and that it's not readable. So if you can't see it, just let me know. I actually didn't even open up an R window yet. So I didn't scale that one. So let me do that now. Let me get my R window, get my Notepad plus plus. All right, perhaps zoom in a little bit. All right, so one more zoom level and then I will take the R window and fit it kind of to where it should be. All right, so let's go back. So in the assignments, we first had to read in three different files. So it's a little bit bigger now, right? So I think everyone should be able to see it. So of course, like all my scripts start off the same. So lecture number five scripts. Could you upload the PPT? Yeah, I forgot that. Thank you for reminding me. Let me directly log into Moodle and give you guys the PDF so that you can write on the side. Data analysis using R, turn editing on, get myself a window, documents, stock acts. No, this is PowerPoint. Then I want the R course and then I want to have the PDFs. So I'm just going to throw it on Moodle. I'm not going to make a title or a header for it yet. So it should be all the way on the bottom in Moodle. There should be something called lecture six statistical testing. So all right, with that out of the way, loading in three files. I hope everyone was able to load in the files. I had one question during the Tuesday question round, someone who couldn't load in the map. And that was because they were using the read CSV functions and they were specifying all of the different column types, which didn't seem to work because the map is a little bit strange. So it has multiple or let me just load it in for you guys. So let's move to R. So when you load in the map, you can see that the map has four columns, but it only has three headers. And that's of course, because the first column contains the row names. And that's why when you want to specify the call classes, you have to specify four column classes. So the first column class is, of course, character, because it contains the names. But it doesn't have its own header. There's no ID or something like that above it. So in the file, you first have three tabs. So cr tab, mb tab, cm tab or enter. And then the next line contains four elements. And that kind of messes up when you try to load it in. But this time I made the files in such a way that you can just use the read table. And the only thing that you have to set is set the check names to false for the genotypes, because the genotypes have, I think, numerical names. So the names are not starting with a character. All right, so then the first assignment, let me load the assignment, load in the different data sets. So we did that. Okay, and now some basic curves. So let's analyze some data. We'll start with phenotypes. The d number measurements stand for the day of weight measurement. The column wg2 is the litter size after two days. So it explains kind of what the data is and how you can load it in and what you can do with it. But the first thing what we wanted to do is create a global growth curve showing the mean weight for all animals combined. Try out different plotting types and select one that you like best, right? Because plots are a personal thing. I can tell you how to make plots, but in the end you have to use it. And you have to say, well, this looks good to me and I want to use this. So the way that I did this is I first define a variable called days, and this will contain the names of the columns, which contain measurements, right? Because there's also columns in the phenotypes that are not numerical values. So this just gives me an easy way so I can say phenotype and then select the column days. And days contains the different measurement points. So from 21 days to 70 days. So that's the only weights. So that's what I call it. So I take the subset of the phenotype matrix and then store it in a new variable called only weights, which just has the weights in there. And then I just use the apply function because the apply function allows us to do something for each row or each column. So what we're doing is here is saying for the only weights matrix that we just defined, go through the columns and calculate the mean. So this calculates the global mean over the population. And then I say type is l because I want to have a line, then I want to have some weights on the y-axis and the weight is in grams, the x label is time and that's in days. Don't plot an x-axis because I will do that myself and then say loss equals two so that it flips the y-axis. So the way that this looks in r when we go to r is that it makes a plot which kind of looks like this, right? So it's not a very pretty plot, but at least it shows you that my start off at 21 days being around 10 grams. And at the end of the experiment there are around 40 grams on average which are pretty big mice. So hey, you can see the plot and you can learn something from it. Of course I want to add my own x-axis so I can just say at axis number one, right? Because I use the only weights, the first column gets plotted as one, the second column gets plotted at two and so on. So there's eight different. But I want to of course have the days on the x-axis, right? So I'm just going to say at 1 to 8 plot the same days that I just selected, right? So just say 21 to 70 by 7 because we measure them every week. So let's just copy paste this into r and then it shows that 21 days up until 70 days. Although internally in r, this is one and this is eight, right? So we just put our own x-axis on top of the standard x-axis in r. So this is the first assignment. So I had just creating an overall weight graph. All right, then the next thing what we want to do in 2b was add things like mean and standard deviation. So I'm doing the same thing what I did before and now storing it in a new variable called means. And I do the standard deviations by applying two only weights to the columns, the function standard deviation, and then I'm saving them as SDS. So now, of course, I need to set up my plot a little bit more different, right? Because now normally r, if I would just say like plot and then use the means, right? Then the maximum value that you will get is the maximum value in the plot. But of course, if we want to add the standard deviation, then at the last time point, the standard deviation would go out of the plot. So to prevent that, I'm setting up my own plot window. So I'm saying, well, my x range is from one to the number of means that I have. So the length of means, the y range goes from zero grams to the maximum of the means plus the standard deviation, right? So that it all fits within the plot. And then the x label and the y label are going to be the same. Don't plot the x axis because I want to overlay the time points on there. And then here I say type equals none, because I don't want these two points plotted. If I would not say this, then there would be a point at one zero and there would be another point at length means, so at eight versus the maximum value. Last equals two, again to flip the y axis. And then I'm just going to plot three times. So I'm going to say plot the means using a line. Then I say plot the means plus the standard deviations using a line and then plot the means minus the standard deviation using the lines, right? So let's see how that looks. So let's go to R and let's show you guys how that then looks. So this will just create like this. So in the middle we see the mean and then around it we see the standard deviation. So we see that head on average again, the mice are still 10 grams when they're 21 days. And you see that there's some variance. And we can see that at later time points, the variance becomes bigger, which is logical. Some mice grow much heavier than other mice. And so the longer you wait, the bigger the difference in the body weight between different mouse range. So this was the first type of plot, right? In the example or in the assignments. The second type of plot is a little bit harder. And there's actually a sneaky way of doing this. So to have like these error bars in a more sneaky way. So let's show you how I did this. So I use type is h. So h is a horizontal line, right? So instead of making a line that goes through the data points, what it will do now it will go from zero up to the data point. And at the next it will go from one to the data point and then from two to the data point. So it will just draw a line, which is a horizontal line. I do the same thing again. So I set up an empty plot. Then I plot my mean in the middle, right? So this is the average. So this is just a standard line that I normally do. And now I plot the mean plus the standard deviation in the h way. And now I do the same thing for the mean minus the standard deviation, but I make it white so that they overlap. So let's go through this step by step. So let's go to R, right? So let's set up an empty plot window like this. Let me move this one out of the way a little bit. So here we have the empty plot window. Then I'm going to add the first line, right? So these are just the means. Then I'm going to add the means plus the standard deviation, which will look like this. And now I'm just going to blank out the lower part by saying everything from mean, or from zero to the mean minus the standard deviation, put a white line on top of the black line. And that this will, of course, then make it look like this. And tada, we have our plot with the standard deviations in the other way. And of course, we could still add the other plot to make it a little bit nicer, right? So to show something like this. So maybe we can do a lot of things and be flexible with plotting. So is this clear that you can kind of use this trick where you just say, give me horizontal lines and then put white lines, white horizontal lines. So you have a big line from zero to the mean plus the standard deviation. And then you have a white line from zero to the mean minus the standard deviation, which kind of hides the other line behind it. I hope everyone was able to figure that out. There's other ways of doing this. You can use like lines and other functions. But I think it's a nice little trick, because you can just, because it's an artist palette model, right? You can draw something and then just take another paint brush and then just paint over it. So you have to remember that that's always an option, that if you don't know exactly how to do something, you can just say, well, I know how to do this. And then I'm just going to paint it out and just pretend that I didn't plot the other things. All right. So next assignment is assignment number three. We're now going to split up in different worthgrocer groups. So that means that based on the amount of children that a mother had, we want to split it into different groups, right? Because the assumption is, is that if you come from a very small family of mice, there's more than enough milk for you. So you can grow a lot quicker than if you're from a very large mouse family, right? If you have to compete with like 10 brothers and sisters to get milk, then you have to compete. So you're not going to have a lot of energy to put in your body weight. While if you only have two brothers and sisters, then there's not that much competition. And there's like milk aplenty for you. So you can kind of get much fatter. So create a multiplot that shows the growth curve per group, either in style of 2b or in style of 2c. So hey, you're free to choose whatever you want. I don't know what I chose. Let me see. I think I used, yeah, I just made a little function. So I say plot s2c. So I, wait, I'm missing something. Why not the medians? Oh, right. Yeah, sorry, I'm skipping one assignment. So create a new line plot that shows the median at each measurement date. And in light gray, all the measurements for the different individuals. So that is 2c. So I totally forgot to do that. So again, the same thing. I take my only weights, which is the subset, which only has the weights in there. And across the columns, I know I'm going to apply the median. And then I'm going to remember that. So let's go to R and show you guys. So now we have the medians at each time point. So it's not that different from the mean, as long as you have enough measurements. And now the trick is, is that we can now just basically use a standard for loop. So we get so from here to here, we just say, go through all of the rows of this only weight, because every row contains a single individual. And then we're just going to plot the individuals. So we're just going to say plot from on on from x from one to the length of medians, y is the C0 to the max of only weights. And then we're going to add points. Wait, this doesn't seem right. This plot should be outside. So we're first going to make an empty plot. And then we're going to go through each of the rows. And then we're going to add the like lines. And then after we've done all the lines for each individual, we're going to over plot the median on top of it using a more a big black kind of double colored line. So we calculate the medians, right? Let's go to R. We make an empty plot like we did before. And now I'm just going to go through all of the rows of the matrix. And I'm just going to plot all of the lines. So here you see the growth curves for each individual. And then after going and plotting all of them, I'm just going to put the medians on top of it. And this is then the median growth curve. So you can see that some mice get up to 60 grams. Some mice more or less get sick halfway during the experiment, and they start losing weight. And like the smallest mouse ends up at around 20 grams. So that's what you can see. And of course, we do want to plot the x axis as well. So we want to have the days on the bottom. So here we visualize more or less every individual in using a little gray line. And then we plot the population median on top using a black line. So that was question 2c. All right. So now we're at question 3a, where I was talking about the difference in the mice. Had that if you come from a small family or a big family. So the thing is, if you want to make multiple plots, it's always smart to take a plot that already worked and then just make a function around it. So the way that I did that is I just took this plot, right? So that's the plot from 2c, going through all of the individuals. It's the exact same code. But instead of having the code work on this variable called only weights, I define a function called plot as 2c, which is a function which gets a weight subset. So all the occurrences of only weight in this, in this piece of code in 2c, I now change only weights by weight subset. So that's the subset of animals that we have selected. And then I have a function, right? So now I can easily take the only weights where the phenotype worth gross of 2 is smaller than 10. And then I take the same thing, but now I take only weights where the worth gross at two days is larger or equal to 10. So again, I take my subset that I already had and I divide it into two. So individuals which come from a family, which is small and individuals which come from a big family. I'm using the par for setting up the parameters, so to making two plots next to each other. And then I just call plot as 2c on WGS, so worth gross is small. And I'm doing plot as 2c for WGL, which is the worth gross at large, so the large families. All right, so I'm going to just show you guys how this looks in R. And go to R. And so it will use this kind of plotting schema. So we will just plot the different individuals. And let's make it a little bit bigger. So hey, it's the same plotting code. But now we split out into two groups. The only thing which I forgot in this plot is that it doesn't have a label on there. So let's add the label, right? So give another thing here. So main is the title of the plot. So when I go here to plot, I can now just say, well, main equals the main that I'm going to give it. And here when I do the plot, I can say small, and I can say large, right? Just so that from the plot, I can see which one is which without having to look at the code. So let's go into R and replot it. And now we can see that indeed. So these are the individuals coming from a small family. And these are individuals coming from a large family. And indeed, we can more or less directly see that the individuals more or less start off being relatively similar, although these are slightly larger than 10 grams. These are more or less 10 grams. But when they are more or less fully grown at 70 days, you still see that there's a massive difference. And the animals that come from small families end up being around 40 grams, while these end up being around 35 grams. So there is a big influence of the size of your or how many brothers and sisters you have on the body weight at the end of the experiment. And we can see that the biggest mouse from the large family group is around 52 grams, while the biggest mouse from the small family groups end up being 60 grams. So we can we can just observe that. And that teaches us something about the influence that the number of brothers and sisters that you have has on your body weight at 70 days. All right. So in 3B, we want to create four times two notched box plots, one for each time point, showing the combined data of the individuals in the two groups, label the groups correctly. What do we learn about the effect of litter size on body weight, right? So from this plot, we have the hypothesis that individuals from large families become smaller. But is this really statistically significant? Well, for that, we can use the box plot function and we can notch it, right? And then if the notches overlap, they are not different. If the notches do not overlap, then there is a significant difference. So I'm doing the same thing, right? So I'm again in 3B saying, well, make two times four plots. And then for day in days. And this is nice because I already defined my days variable all the way on the top, right? So that I can just say, well, go through each of the days independently. And then I just say, make a box plot where you combine or where you take the worth grosses small on this day, compare it to the worth grosses large on the same day. And then the main head, I'm going to take the day variable g sub the d by day so that it looks a little bit better. I'm going to notch them so that we have these nice notches on the side. And then I'm going to put litter size on the x-axis and I'm going to put weight in grams on the y-axis just so that I know what I'm looking at. And I'm going to plot of course my axis myself because I have to tell or I have to inform or show on the plot that at one, we have the small worth grosses. So we have the small families. And at two, we have the large families. So heavier have the small and the large. So how does this look? Well, let's just copy paste this into R. And this is more or less how it looks. So you see at day 21 that indeed the individuals coming from larger families are already significantly smaller than the ones which come from the small families. Although when you look at the range, right, you can see that the range for the small families is much bigger, right? The smallest animal at this time point was actually coming from a small family and not from a large family. But you can see that across all of the time points, there is a significant difference. And that is because at no point in any of these plots, due to two notches more or less hit each other, that they're always separated, which means that there is a very significant difference or a significant influence of the size of your family compared to your eventual body weight. And already you're starting body weight at 21 days. Alright, image. So any questions so far? If you had any remarks, then just throw it in chat. It's if you have like a different strategy or or answered it in a different way, then just tell me. And although Twitch chat is not that good for code, if you want to send me code like I did it this way, then just put it in my mailbox. Alright, so question number four is use the data from the genotypes txt and first use the data from map.txt to subset the genotypes for a single chromosome. After you've created subsets, e.g. for chromosome one, create an image for the chromosome showing the different snips in the order in which they occur in the map, right? So this means sort the map on base pair location. So let's just go and go through. So for a, I am going to use the sort function. I'm going to say, well, from the map, take the positional argument, which is the mb, which is the mega base position based on this genome build, sort them. And I'm not that interested in getting the sorted values back. I'm interested in the ordering, right? So I'm going to say index is true. And this will give me back the indexes. So let me show you what I mean. So let's just go to R, right? So if we go to R here, and we look at a little bit of the map, right? So let's look at the first 10 elements, we can see that this map is completely unordered. It's not sorted, the base pair positions are not sorted, and even the chromosomes are not sorted. So I can then use the sort function on the map. But when I use the sort function directly on the map like this, then it will just sort all of the different positions. But I'm not really interested in the sorted positions, right? I want to know how I should reorder them. So then I can add this index equals true. And now I get a list with two elements. So the first one will still be the same as before, right? The x, which is the sorted values. And I will also get an element called ix. And here it will tell me which ordering I need to use, right? So it tells me that the smallest element in this column was at position 414. The element after that, so the one after smallest element was at 1208. And then the next one was at 2819. So the ex are the indexes. And I can just use the indexes to reorder. So when I take this and then store this in a variable called ordering, then now ordering has two elements. Let me see. So right, it has the x, which is the ordered values, and it has ix, which contains the indexes. And now I can just use the indexes and then say, well, take the map, use the ordering that we just figured out, and then plot them like this. And now when I let it roll, you can see that now at least the positions are ordered from lowest to highest, right? So it's just an increasing order. Of course, the chromosomes have not been ordered. So I have to still order the chromosomes. And the problem here is that you can't really use sort to sort the chromosomes, because that's a little bit problematic, because chromosomes are not numerical values, because you can have chromosome x, you can have chromosome y, you have mt for the mitochondrial chromosome. So you can't just use sort. So the way around that is the way that I generally tend to do it is just say, well, I first reorder the map based on the base pair positions. And then I'm just going to create a variable called new map. And I'm just going to go through the chromosomes, pull out one chromosome from the map, and then row bind this to the new chromosome. So what I'm doing is because everything is ordered by the position, pulling out chromosome one will make a matrix for chromosome one, which is also ordered. And then I just say, well, then take this, take this, these markers and put them on the new map, and then go to the next chromosome. And in this way, I can use a vector to determine the exact ordering that I want. Because the default ordering for character values is going one, 10, 11, 12, all the way up to 19, then you have to 20 and so on. So it's not the standard ordering that we will use when we order things by chromosome. So let's just order the map, right? So we're just going to go through the individual chromosomes, take them out and then recombine them. And then I'm going to say that, well, the new map, right, the one that we just created override the old map. So now when we look at map, at like the first 50 elements, it looks like this. And so now we have chromosome one, and the markers are ordered in the correct ordering. So from smallest to highest. All right, so then we have to do all of the other steps, right? Because we now want to make an image. So the first thing that I have to say is that I want to go back to one plot, right? Because R still is set up to create two by four plots instead of one by one. So what I'm going to do is I'm going to go through all of the chromosomes. And then I'm going to say on chromosome, right? So I'm going to ask which of the chromosome of the chromosome column on the map matrix are equal to the chromosome. I'm going to subset that and then I'm going to just ask for the row names. So I'm going to use the names. So on chromosome will contain the names of the markers which are on chromosome one. Then I will take my genotypes. And because I have the names of the markers which are on chromosome one, I can use the names of these markers to subset my genotype matrix as well. And here again, I have to do a little trick because I can't plot character values, right? Because let me go to R. So when we go to R and we look at the genotypes, then the genotypes look like this, right? So we see that the markers are here and here are the individuals in the columns. And we can see that, of course, we can't plot this because R doesn't know how to plot an A, a B, or an H. So R needs to be aware or when we need to convert this from H, H and B coding into zero, one, two coding or one, two, three coding because we can plot numbers but we can't plot characters. So the way to do this is kind of standard. I use this a lot, right? So I say take, so across this sub matrix, right? So across the markers on this chromosome, apply to the columns a function of X. And now I'm just going to say, well, I'm first going to take my character value, make it into a factor, and then I'm going to convert the factor to a numeric. And this will work, of course, because there's only three possible states at each marker. So you're either A, H, or B. So this just, this works. So going from a character, we can go to a factor. And from a factor, we can then make a numeric value. But we have to go through the factor level, because a character cannot automatically be converted to a numeric value. I then put the row names back. So I convert them to numeric, but in the process, for some reason, I lose the row names. So I put the row names back. Is it only me who can't see the chat? How do you mean you can't see the chat? You just typed in the chat, right? Or am I misunderstanding? I can send something in chat. Can you see that? Let me change my mood as well. I'm going to go to zombie mood, because I'm green. I'm wearing green today. Okay, now I can see. Yeah, perfect. Okay, good. Yeah, solved it. Yeah, hello. Good. So yeah, I'm losing my row names when I do this transformation. So I just put them back. And then I use the image function to just say, well, get this thing, do it as a matrix. And I'm just going to plot 20 by 20. Because otherwise, they will be really big, right? So and this is a for loop. So it will plot all of the chromosomes one after another. So for showing you guys how this works, I'm just going to say, do it for chromosome number one, right? Because that's the one that we want. So first, I have to switch back to one plot, right? Because we have eight plots, and we only want one. And then what am I going to do? Well, I'm going to take on chromosome. So I'm going to ask which markers are on this chromosome. And then I get on CHR, which is just the names of the marker. And I can use this to then subset my genotypes, right? So I'm going to say, well, genotypes on chromosome. So this is just the markers that are on chromosome one. And then I'm going to use this apply function to make them into numeric. So when we would look at genotypes on chromosome, and we would look at, sorry, on chromosome, and we would look at the first like 10 by 10. Then it would be like this, right? And when we now change this to that's not the best name ever. Then you now see that it has converted them. So the H is converted into a three, the B is converted into a two, and the H is converted, and the A is converted into a one. If this is the coding that you want, then that's all fine. You could change your coding as well, because you can kind of set the different levels that you are expecting. But this is perfectly fine because these are now numerical. So we can now just use the image function to plot those. And of course, you see here that we lost the row names. So we have to put the row names back on this thing. So now when we look, we see that we have the row names back, which is really good. And now we can use the image function to plot it. So it looks like this. So chromosome one, here we have the individuals in this row are in the y-axis. And on the x-axis, we have the marker. We could actually plot the whole thing. I don't know exactly how many markers and how many individuals there are, probably a lot. So you can see that there's a bunch of individuals and not so many markers. And then what do we want to do? Well, we want to actually plot the names on it as well. So here in very, very small, we see the names of the marker. And here we see the names of the different individuals. But of course, this is way too small to see here. But of course, when you would do a PNG, so you can write it out as a PNG, then you could just make a very big PNG. You could just say I want a PNG, which has 3000 pixels in the horizontal direction and 2000 pixels in the vertical. And then of course, it would be readable again. So that's the way that it works. So in the answers, I already have like a PNG and a dev off here. And so in case I want to save it to a PNG file, I would just say PNG. And then what's the file called like plot.png or whatever I want to want to call it. And then I could just uncomment this and give these to give these to a little bit of extra room and then say dev off. And then it will close the device and it will make a PNG. And if I want to make it bigger, then of course, I could just say why is like 10,000 pixels. And then it would make for a very, very wide plot. But then of course, there's some issues with that. But that's the way that it works. All right, so that's more or less for right, open a plotting device, PNG or a pack and these kinds of things. All right, and now the chromosome plot. So I think this was the hardest, although the code was more or less given into the in the in the lecture. So from the data, create a chromosome plot showing the locations of the markers on the chromosome, use the code that was given during the lecture. So it's more or less the exact same thing as what we had before. What am I doing? So I'm loading the chromosomes. So which chromosomes do we have? And then I'm making my own chromosome infrastructure, because I need to know how long every chromosome is, right? So I'm just going to say, make a matrix, fill it with nas. So it's the length of the chromosomes. And there's only going to be one column. And the names are the chromosomes on the x axis. And on the columns, I want to have length, right? So I'm just making a very, very small chromosome infrastructure. And then I'm going through all of the chromosomes, and then adding the maximum length in this little matrix. Because for every chromosome, I need to know how long it is. If I want to plot, I plot it. So that's the way that I'm doing it. And then I define a variable called m length. And this is the length of the longest chromosome. Because I need to know how long every chromosome is, but I also need to know how long the longest chromosome is, because I need to set up my plot window. And the plot window is going to range at the y axis from zero to the maximum length. And on the x axis, of course, from one to the number of rows in chromosome info. So the number of chromosomes that we have, don't plot anything. Don't make a x axis. Don't make a y axis. And don't put any labels there. Just make a completely empty plot. All right. So how does this look in R? Right. So when we do this, and I can show you my chromosome infrastructure as well. And this is nothing more than just chromosome one has a length of 194 million, while chromosome x is 153 million. So that's just what it does. And it also does the m length. And m length is of course the length of chromosome one, because by definition chromosome one is the longest chromosome. All right, then we have this empty plot. And now I can just switch you guys back to here. So now I have my empty plot right. And now I can just use the upline function, say that I want to have horizontal lines, which go from zero to the maximum length and make them blue and dotted. Right. So I'm just going to add some lines, every 10 megabases, which is one E seven, one E six is a million. So this is 10 million. So every, every like 10 million, I'm going to make a line and add it to the plot. Right. So here I have my really nice blue dot. And now I can kind of see where I am on the plot. And then I'm just going to plot all of the different chromosomes. And how am I going to plot the chromosomes? Well, I'm just going to define a variable called count. And again, this is because chromosomes can be characters. So they can be x, they can be y, they can be empty. And of course, if I want to plot something at x position, empty, that is not going to work in R. So I'm just going to say, well, I'm going to apply to my chromosome info to one. And of course, you can do this with a for loop as well. Right. You can say for x in one to the number of rows of chromosome info, but I'm just using the apply function because it's a little bit quicker. So I'm going to apply to this chromosome infrastructure through the rows of function and x will contain the row. And then I'm going to say, well, give me a lines. The x position is count count, right? So at one, one, take the length and make the line go from zero to the length of x, type is line, make it black, use a filled line and make a line line y width of two. And then what do I have to do? Well, I have to update my count. And you see here that this is the first time that I'm using this double arrow, right, to assign outside of the function. Because I need to, every time that this loop runs, I need to go one further into the plot. So this is the only time in my entire programming career of like 16 years that I use the double arrow. So where I assign outside of a function. And this is not because I'm lazy, this is just because I need to update this count variable. Because otherwise I don't know where in the plot I am. So I'm just going to copy paste it in and show you guys how it works. So we're going to just go and plot. So this is zero one, one. And this is the length or so a line from zero to the length of chromosome one. So here we see the different chromosomes. Of course, this plot is not done yet. What do we still have to do? Well, we still have to add an axis. And we have to add the markers. So I'm going to use the same structure again, here for my chromosome plot. So I'm going to say apply to every row of the map a function of x. So what am I going to do? Well, I'm going to match the chromosome that I'm currently on to the chromosomes, right, this chromosome structure, which just contains one, two, three, four, five, six, x, y, and m. And I'm going to match them. And match will give me the numeric position of the current chromosome in the list of chromosomes. So if I'm looking at x, it will know that, oh, if x matches to this vector, then it is 20 because chromosome x is at position 20. The y location, so the y lock, so the height where I want to plot my straight line is, of course, given by the position of the marker, which is x megabase position. And then I'm just going to use points. I'm going to say add x comma y, use a straight line, make it blue, and make it twice as big as normal. And then, of course, when we do this, then we will, on the map, start plotting all of the different markers. And that looks like this. Right, so now we can see that, oh, one chromosome one, we have a little gap here. We don't know from where to where it is. So we have to add the x's. So I'm just using the x's function to put the x's on the one side and on the other side. It just doesn't fit entirely. But if I look at it like this, so now we can see that on our genetic map, we don't have any markers from like 22 megabases all the way to 40 megabase. So this area is an area where we have no information on. But for example, this area we have a lot of information on because there's a lot of markers which are very close together. All right, so this is how you do the chromosome plot. I hope everyone was able to do it and at least try to do it. I didn't get as many questions as I had expected, but I got one question that I actually want to talk to you guys about. And that is using the par function. So we know that when I use the par mf row, right, I can set the number of plots that I have, for example, 2 comma 1, and then I just do plot 1, 2, 10, and another. So now I have two plots on top of each other. If you use an output device, you have to set the parameters for the output device, but they have to be inside of the output device. So let me show you how that is done. So if I would do something like I would do png, right, my png dot png, and I would do a plot 1 to 10, and then I do another plot 1 to 10. And I now want to have these things on top of each other, right? Of course, I have to do def off to close my device. Then if I would do my par mf row to 1 on top, then now I'm changing the plotting window in R. If I move this one within the plot here inside of the png, right, so inside of this output win, now this parameter is applied to the current device, and the current device is now the png device. And this is something that goes wrong often. And one of the students had an issue with it. So I think I want to mention it very clearly that this means the parameter applies to the R plotting window, the current window, which is open. This is now the parameter applies to the png. So because this par always pertains to the active window. And outside of the png, right, so between png to def off, the png device is the active device. But here, when I have no png def off structure, then of course the active device is the R plot window. So just so that you guys know as a little example, let me remove it. And I hope that's clear. So if you want to change like your font size, and you do your par, and you do that before you do the png, then the text in the png will not be bigger. And this is usually a source of like, where is the error? Because I'm changing my font size, I'm making my font size three times as big, and I see nothing changing in my image. Where is it going wrong? And 99% out of 100, it will be that you're you're setting the par before you're opening the output device. So if you want to set parameters, always do that after opening the output device. All right. That's it for the assignments. I'm just looking at my list. But yeah, that was it for the assignments. So I wanted to show you guys something else as well. And last year, I went through it in great detail. And tell me if you want that, but I just wanted to show you like in the in the PowerPoint, I promised that, right? So I'm going to give you an example on how you can plot data on a world map, for example, COVID-19 data, or flu data, or whatever data that you want. And this is kind of showing you all of the different things that we discussed during the last five lectures, to show you that you can actually make really good looking plots already, if you combine all of the things that we've learned. So I wanted to show you something like this. And first thing that I need, if I want to plot something on a genetic map or on a world map, I need a picture of a world map, right? So I took my picture from the world map from here from solarsystemscope.com. They have really nice textures, like for earth, the moon, Mars, Venus, all of the planets in our solar system. And I just downloaded the earth texture and I scaled it down to 50%. Because they are really, really high definition. So that means that there are literally like millions and millions of pixels on there. And I didn't want to have that many. So I scaled it down to 50%. The COVID-19 data I get from GSSEC and data, which is the main data provider from the John Hopkins Institute. And you can actually download the data directly from GitHub. So the data is located at this position. So it's called GitHub user content. And then you have the provider COVID-19 data. And then I want to get the time series data. And they provide data not only on the number of infected people, but also the people that died and on the people that recovered. But we're just going to get the confirmed cases globally, because I'm not interested in the UK or not just interested. I want to have all of the numbers. So the thing that I can do, and I told you guys that that's possible, is that you can directly read from a URL. So I'm just going to do read CSV on the URL. And I'm going to read it in. And of course, I have to set the separator, say that there's a header and these kinds of things. And I'm going to change the missing values by zero. So if a value is missing, I assume that there were no infections that day, which of course can be true, but it cannot be true. All right, so let's copy paste this in and have it download so that we don't have to wait too long. So it's just going to read in. So if we look at infected, the matrix that I just downloaded, this is the matrix that you can kind of get online, right? So it has first the province or state, then the country, the region, it has a latitude and the longitude, it has a date, and it has the number of people who were infected at that date. And I think it's the cumulative number. So every day, the number goes up, the number never goes down, because that the number going down would be the number in the recovered column of the other data going up. So that was the first thing, right? So this is the data that we want to visualize. And we already seen like this reading binary data from a BMP file, right? So it's just a standard BMP file that I created. It's 1024 by 512 pixels. I just got it from this web address. So I'm just going to use the code that we had to get the red, the green, and the blue channel. And then I'm going to plot this image using this pointillism style, which we had in the previous, I think lecture three or four, when we were looking at binary data and reading in data. So of course, I'm just going to read in this image, and then I'm just going to create my plot. And I'm just going to show you guys how this looks. So I'm going to plot, and then I'm going to use points to put the color onto the plot. So let me show you guys how this works. So I'm reading in the binary data, and then it's relatively slow because it has to plot every point. And then it just plots the world map into an image in R. So that looks like this. Let me make it a little bit bigger for you guys. And so what it will do is it will just like plot every pixel and make the right color. So it will take the color that you had in the original BMP image, and it will just put it here. So we see Antarctica, we see Antarctica, Greenland, Europe, and these kinds of things. Let me make the plot a little bit better, right? Because the world map that you normally see in Mercator Preactors is a little bit bigger than it's not as wide as it is high. So here we can see a very good reflection of the world map. So now the next thing is I want to select some countries, right? Because I'm not going to plot all of the countries in one go, because then the whole map would be covered. Well, not at the beginning of the outbreak, but currently, like there's a lot of COVID going around. So we should select a couple of countries. So I have selected France, Germany, the Netherlands, Austria, Portugal, US, and China. And then the thing that I'm going to do is I have the infected matrix contains the number of infected. And I'm just going to ask which ones of the countries slash regions are in my selected countries and regions. So I'm just going to make a subset like we've been doing a lot of times. All right. So let's just do that, show you guys in R, and let's make a subset. So now when we look at infected, and we look at the first 10 columns, we see that we have like Australia, China, France, and the Netherlands. And of course, some of these countries are split up in multiple regions. And because the Netherlands still has some of its old colonies, just like France has, and China of course has their data separated by each of the different provinces, which is perfectly fine. All right. So let's go back to notepad plus plus. So the next thing that I'm going to do is load the ColourBree or library, because I want to use very fancy kind of colours which are optimised for people who are colour blind, so that they can still see something. So I'm going to select, so I have one, two, three, four, five, six, seven countries. So I want to have seven colours, so the length of selected. And I want to get them from the accent set, which is one of the prepared colour sets that is available in ColourBree. And then I'm going to say, well, I store this as cold.countries, and then I'm going to give the names of the countries. So I'm going to give the colours names, and these names are of course the names of the countries that I have selected. So let me show you how that kind of looks in R. So when I look at cold.countries, I now have for each country a certain colour that has been assigned to it. And of course, I think I made sure that the Netherlands is actually orange, because that's our colour. All right. So next step is get the number of infected people on a certain date. So last time that I did this, we were at 7, 9, 2020. So that's October, 7th of October, I think 2020. Of course, we could change this to like today, which would be five. Well, let's take a couple of days back, because the dataset might not be accurate up until today. So let's take the 24th of May, and then look at the number of infected. So now if we look at an infected, it shows us the number of infected people for each of the different regions that we have been looking at. Of course, I need to scale this, because I, oh wait, you couldn't see the R window, right? So here we see the number of infected people for each of these regions that we have selected. So there's a lot, a lot of infections in the last country, which is the US. And since the US is not split, all of the current infections in the US are listed in one number. But every region or every country which has a certain region, we just get the number of infected people at like three days ago. All right. So what do I want to do then? Well, the thing is, is that this is the big trick. And this is going from having the latitude and the longitude. So head latitude and longitude are based on a glow, right, which is a ball. So what we are going to do is we're going to say, well, let me kind of look here, right? So if we look at this map, then 00, latitude, longitude is actually more or less here, right? Because the zero meridian runs through London all the way to here. And then we have the equator and the equator runs around here, right? So the 00 point on this map is around here. But our map in R starts here. This is 00. And this is 1024 by 512, right? Because this is just all the pixels that I had in my image. And my image is 1024 pixels by 512. So latitude and longitude speaking, this is 00. But for R, this is not 00. For R, this is 512 by 256. So we have to modify our longitude and longitude so that they kind of correspond to the map. Otherwise, everything will be shifted. So the way to do this and to get rid of this shift is say, well, I take the latitude that I have, and then add 90 and then divide by 180, right? This is kind of a transformation where you say, well, I take my point, the latitude point of 00, then I add 90 to it, and then they divide by 180. Because latitudes run from 0 to 180, while longitudes range from 0 to 360. Right? So I'm just moving the range to fit our image. And then I do plus 90 divided by 180. So now I have my latitude ranging from 0 to 1. And then I'm just going to multiply this by 512. So this is actually the height. And then I'm going to do the same thing for the longitude. And the longitude is when I divide by 360, now the longitude goes from 0 to 1. But the width of my picture is 1024 pixels. So I need to figure that out. So I'm just going to take the latitude and the longitude from the matrix from the selected countries. And then I'm going to calculate the real positions on the map. And then I plot the data on the map using the standard plot. So I'm just going to say, get my points x and y. And then head the point size is the number of infected divided by two. And that's because we have to scale it in some way. Because I can't use a point size of 5 million in R, because that would cover the whole image. So hey, I take the log 10 and do some rounding. And then I just make sure that the biggest circle is not that big. But in relationship, these circles are relative to each other. They make sense. And they are consistent. All right, so let's just do this. So let's get the number of infected people. So we're just going to get the number of infected people, then modify the longitude and the latitude. And then we're just going to put points on a map. So here we see the total number of infected people up until the 24th of May this year. And we can see the different countries. So we can see, for example, that Holland is orange. And we see here the Dutch colonies. Furthermore, here we have France. I think France is green. And you can see that France still has some colonies near Australia. They have like near Madagascar and they have some other colonies. And the USA is here. And here we see China, where the outbreak originally began. And here we have the Australian. And the Australians have divided it for one number for each of the provinces. Right, so these are all things that you have already learned. And it's just putting them together in a way so that you can kind of visualize whatever you want. And of course, we add a nice legend so that we know what we are looking at. And we could add the date on top. And this just doesn't, it not only works for a world map, you could do this for Germany as well. Or you could do this for the Netherlands. Just get a map of the Netherlands and then move your latitude and longitude positions so that you are in the range of where the map is plotted. And so in our case, it's relatively easy because we have like zero to, well, this is zero. Right, so we have negative and positive longitudes based on zero. And we have longitudes which are also zero here and then go up and go down. So we first scale it from zero to one. And then we just multiply it by the size of our image, which is 1042 pixels by the other one. So we'll put this script for you guys also on Moodle so that you can look at it and perhaps use it later on. Perhaps you have your own data and not COVID data that you want to plot on a world map. Perhaps you don't want to use a world map. But for example, you have fish measured all over Germany and you want to plot the different lakes. The nice thing is you can also plot pie charts, right? So instead of having like one circle or one big dot, we could show pie charts of the number of people who are infected, who are recovered and the number of people who died. And so you can still use all of the available plotting tools that you have. It's just that the background is now a world map or a map of Germany or a map from whatever you want. All right. And with that, I am going to stop the record.