 So, almost two. Let's do a sound check. One, two, three, check. One, two, three, check. I hope people can hear me. If you can hear me, then see something in chat. Make me happy, chat. Your sound is check. Good, good, good. Thanks, Misha, for checking my sound. Welcome. So, today we're going to have an interesting lecture, I think. It's a combination of several different lectures. Now, you should be able to see me as well. So, we will talk about logistic regression, common idioms, and I will explain you guys probably how to make an R package, but that depends a little bit on the time that we have and the amount of questions that there will be. God-visual. Very good, very good. So, then everything's working. That's perfectly fine. So, I've been getting a lot of emails about the exam, which apparently is not in Agnes yet, or at least some people can't see it. I've been getting some mentions about the Prüfungsbüro, not answering to emails, and the big issue is, is that I unfortunately cannot do anything about it. Because of the fact that I can only do one thing in Agnes, and that is input grades. I can fill in grades, and that's where my power ends unfortunately. So, for everyone who has having issues with registering for the exam, please just send an email to the Prüfungsbüro, because they are the only ones who can do that, and I know it is difficult. I just looked on the website, and it seems that no one is there. From the four full-time employees that should be working there, two of them are sick and two of them are on holiday, which is not uncommon. It's kind of more or less the de facto standard. So, when the exam period starts, all of the people responsible for managing this are not there. But just send them emails. Look on the website who their bosses are, and if you don't get a response within a week, start sending their bosses' emails as well, because that's the only way that they will ever start replying. One of the other tips that I can give you is go to the Institutsrat. So, figure out when the Institutsrat is there, and then just show up with five or ten students and start loudly complaining about the fact that you cannot register for exams, you cannot get your paperwork, and that they should fix it, because that's the only way this works is if no one complains, no one changes anything, and the Institute needs to be made aware of that. Anyway, with that out of the way, welcome, welcome, everyone. Logistic regression and common idioms. So, let's just show the first slide. So, the exam will be on the 28th of July. It will be via Zoom slash Moodle, but that depends on the amount of people that registered. And the re-example will be on the 23rd of the 9th month, again, at two, and either via Zoom or Moodle. So, sign up via Agnes, if you can see it, if it's there. Let me actually quickly check my Agnes, because I can log in to see if anyone has already signed up. Right. So, no, my course does not seem to be in Agnes yet, because I can only see the other course from the other person that I can fill in grades. They already have five people who signed up, but my course is not there. So, I will also start mailing the proofings below, and I will also start calling them to make sure that they put it into Agnes. Anyway, with that out of the way, try and sign up for the exam. There's nothing really that I can do about that, and that's just a pain, because if they would just give me the rights, then I could just put my own exam in there and register you guys, but for some reason they don't want to do that. So, we'll just have to live with that. Good. So, first off, because I asked you guys to read the PDF from Boda Winter. We had the lecture last week about repeated measurement models, so this part of the more or less linear model equation. The week before we talked about general linear models, so standard linear models, when you have individuals measured one time, and when you're just interested in covariates and covariate structure, and then we talked about repeated measurement models last time. I got one question via YouTube. I got one question via the comments, and I answered that. And I think it's very important to understand that you only need a linear mixed model if you have relatedness between your individuals, or if you have measured individuals multiple times, right? Because the standard linear model has the underlying assumption that all of your measurements are independent, so that they are done on independent animals, or independent humans, or independent days, or whatever, right? So, you can always use a general linear model when everything is independent. However, you often run into a situation where this is not the case. For example, you can think about human research in like the hospital, right? Then generally, when you look at like 150 people from a region, there will be some relatedness, right? Some people are closer related to each other than others. And then you need to account for this fact. We generally call this population structure. So, when you have to deal with a more complex population structure where individuals are not independent, because they share a father, they share a mother, or they share like a great-great-granduncle, then this has to be told to the ANOVA. And the only thing that this does, by compensating for population structure, you are better able to estimate the exact number of independent samples that you have, right? So, you have k, which is the number of parameters that you are estimating, and you have n, which is the number of measurements that you have. And the bigger n is relative to k, the more power you have to do an association. Good. So, I wanted to make that clear, because I think that from the last lecture, it was not entirely clear that when should you use a linear mixed model, when are things a random effect versus a fixed effect. So, if you are confused, or if you don't know what to do, then my standard advice is, put up first a model, and then, because a fixed effect is always valid, because including something as a fixed effect just means that you get an unbiased estimate of the beta, of the, how do you call it? It's not really the correlation coefficient, but it's the beta as the coefficient, which tells you with an increase of x, or a certain increase of y, or a certain decrease of y. And the fixed effect betas are always unbiased, and they are always true. So, first fit stuff as fixed effects, then change to using them as a random effect, if you think that it might be needed to treat them as a random effect, and if that is the case, then the beta should be very similar. So, the beta of the random effect should be in line with the beta of the fixed effect, because the fixed effect's beta is the most unbiased estimate. Good. So, today I wanted to quickly talk, if there are no questions about linear mixed models, then I assume that everyone understood the PDF that Vodawinter published online, and which I used to base my previous lecture on. If there's no questions, then that's good, and then I assume everyone read the PDF, everyone understood it, and it will be part of the exam. So, on the exam, I will ask some questions which are coming from the PDF. So, do read it, because it's a very, very good introduction, I think, into repeated measurement models or mixed models. All right. So, today I wanted to extend this a little bit, because we want to talk about what can you do when your phenotype or your measurement of interest is not a normal distribution, right? Because all the top part of this model assumes more or less a normal distribution, or assumes a normal residual error. But what is when you have a case control study, right? If you have cases and you have controls, or what happens when you have a binary response variable, right? You're doing, for example, drug research, and you're giving people a drug, and these people respond to the drug, or they do not respond to the drug, right? Then you have a 0-1 story, and of course, when you have your input data being 0 or 1, non-responders versus responders, then of course your residuals will never be a normal distribution. So fortunately, there are ways to deal with that, and these ways are called generalized linear models. So a general linear model is a model which assumes normality. A generalized linear model does not assume normality, but it assumes a different type of distribution, like logistic regression, or binomial regression, or loglinear regression. So we still need to know the underlying structure of the data. We still need to be aware of the distribution. But with this distribution, we can then still do linear models, but we have to tell the linear model that we are doing something else, right? And then of course, we have mixtures of these. So you have GLMMs, which are generalized linear mixed models, and you also have mixtures between general linear models and mixed models. Good, so like I told you guys, generalized linear models, the question that we want to answer here, or the problem that we're solving here is, what happens if the response variable is not a continuous variable? So what happens if it's binary, or if it's a categorical variable? And then I want to talk to you about common idioms, but I also wanted to explain you guys what the difference is between a long and a wide format. Because we see this a lot in people that are gathering data for their experiment, that they gather it in the long format, and they actually need it in wide format. And I will talk to you guys about what is the difference, how can you detect what is what, and then in the end show you some examples on how you can use an external package to very quickly go from long-formatted data to wide-formatted data and back. We will also be talking about some idioms. So these are just very basically code snipplets that I use a lot, and that I wanted to give to you guys so that when you run into these things into the future, for example, I need to convert dates to seasons, that you have some code ready that you can just use in R to convert your months into, for example, seasons. And then I wanted to tell you guys how you can use R to build pipelines. So pipelines are more or less large analysis scripts where every part of the pipeline is dependent on the previous parts of the pipeline. And this is of course because you need to be able to run your own R scripts from the command line. So the script needs to take input from the environment or needs to take input from the command line itself, but you also need to be able to execute external programs. If you think about whole genome sequencing, then there's a lot of programs and code out there which need to be executed, right? Because it's not available in R itself, but there's an external program. And this also comes into play when you do things like image analysis. So we have, for example, big image analysis packages like image.j, which are there to do stuff based on images, right? Make them bigger, make them smaller, change the resolution. And these programs are very useful to use when you build a pipeline, right? You might have a microscope where you took pictures of cells and then you want to use a program to determine the size of the cells, of the content of the cells. And you want to do this for not just one picture, but you want to do this for hundreds and hundreds of pictures that you've gathered over the last year. Then what you normally do is you build a pipeline and then this pipeline will go through all of the steps that need to be taken and it will do this for each picture or for each sample that you have analyzed. And then if we have any time left, then I will very quickly run through how to build an R package. I've done this part a couple of times, but the R package part is also in the last year's lecture and also in the bioinformatics course, I generally very quickly go through. But today I've added some additional things like how to put C code behind R packages. And I just wanted to show you guys how you can wrap up your code, right? So when you have written code, then you want to have other people use this code and you are just the only one using your code, then it's not really useful, right? The idea behind science is that you can collaborate with other people, not just on writing the code, but if you make something which is amazing, you want to share it with the rest of the world and then people need to be able to get your code. And the way to do this in R is by R packages. So I hope that we have time enough to do that as well. So about next week, I will be going to a conference in Rotterdam on Sunday, I think, and I will be in Rotterdam the whole week. So next week, I do not know exactly if we will have a lecture. I will try to do it, but we will have to see how it works out. I'm at least going to take my headphones and take some of the setup that I have here, probably with me, just in case that I get the opportunity. If this is not the case, so by any chance I get stuck on the conference or I don't have internet or these kinds of things, then we will drop the lecture and we will do it the week after. But I think because we only have like two weeks left, can do a test run if you have the time. Yeah, we'll have to see how it works out in Rotterdam. Like, I don't know the conference center, I have no idea if there will be proper internet, how many people there will be and these kinds of things. So we might do a test run, we might do something else, but at least be aware that next week will be a little bit different, but since we only have next week and the week afterwards for the lectures, and then I think we are already at the end of the semester, I don't want to drop it, so that's my main concern is that I don't want to skimp out and have you guys have one last lecture. So, yeah, I will keep you informed, Misha, no problem. All right, so that's just the things that are more or less that I wanted to talk about. All right, so let's talk about regression when we deal with different response types. Right, so when we do regression, then what we want to know is the relationship between a dependent variable and an independent variable, and we are dealing with different response types sometimes. Right, so we can have dichotomous outcomes like sick versus healthy, pass versus fail, and then we need to do logistic regression. So instead of doing standard regression, we need to tell the algorithm that the outcome is not a normal distribution and that it's not a continuous distribution. So for this, I found a really nice data set. So the data set is about admission into universities or admission into one certain university and there is a binary response because people either get admitted to the university or they don't get admitted, right? So there's a zero one outcome. And then to study effects of what actually influences your admission rate to a university, they have certain predictors like the GRE, which is the graduate record exam score. They also have the GPA in there, which is the grade point average. And then they have something called the rank, which is the prestige of your high school. Right, you can have a very prestigious high school and it might be that that increases your chance of being admitted and the same thing might hold for having a high GPA. Right, so that's the thing that they wanted to investigate using this data set. So this is about UCLA entries. So first things first, we need to load in the data. Fortunately, the data is just available online, so we can just use a breed.com a separated file on the admission data that has been provided by UCLA. And I store it in a variable called my data. Very basic, nothing really special to see here. All right, so first things first, if we want to do logistic regression, we have to make sure that none of the groups that we are analyzing are empty. Right, so for example, if we look at the rank, so the rank of or the prestige of your high school, then that was classified in four different levels. Right, you can have rank one being a very good high school, rank four being a very bad high school, more or less. Right, but the main issue with logistic regression is that you are only allowed to do it when none of the groups are empty. Right, so you first have to check and make sure that none of the kind of admission groups based on the rank are empty. So it's what we do here. So we use a two-way contingency table, and we can just use the xtops function for this. So what I'm doing here is I'm just checking in R to see if none of these groups are empty or have a missing value. Right, so what I'm doing is saying, make me an xtop, I provide my data into the data parameter, and then I'm just going to see, well, I want to model the admission by the rank. Right, so the admission is a binary variable in either zero or one, not admitted, admitted. And in the end, we have the rank, which is the prestige of the high school. And already we can start seeing that there seems to be a relationship between the two. Right, rank one schools, more people get admitted from rank one schools than get not admitted from rank one schools, while, for example, a rank four school has a more or less complete opposite story. So out of the 67 people who apply, 12 got admitted and 55 did not get admitted. So we can see that directly from the data we get the impression that, yeah, the rank of your high school does have an influence on your admission to UCLA. But let's test this a little bit further, but none of these groups are allowed to be empty. Right, so you have to do the same thing for GPA and for your GRE. Although it only matters when you have a factor. Right, because GRE and GPA are probably continuous variables. Right, then of course there will be empty groups. But if you have, if you want to do logistic regression and you have a factor response, right, yes, no admission, while you have a predictor, which is also a factor, then it is not allowed for one of the groups to be empty. So this is something that you need to, that I want to stress that really makes sure that when one of these is zero, then the whole result is not to be trusted when you do the modeling. So x stops, a very useful function to make two-way contingency tables. All right, so after checking all the possible contingency tables that we have, we want to perform logistic regression, also called logit regression. So the first things that I want to make sure is that the rank of the high school is definitely a factor variable. Right, because we don't want to model this as a numeric because the rank is one, two, three, and four, so we want to tell our explicitly not to treat this as a numeric variable, but we want to treat it as a factor, a grouping variable. And then we just say, well, we model all of the data using the GLM function, so the generalized linear model function, so standard, very similar to the LM function that we saw before. So in a GLM, we can just use the same way of writing down our model, so we say that the admission to UCLA is based on your GRE, your GPA, and the rank of the high school, the prestige of your high school. We provide the data using the data parameter, and then we specify the family, and the family in this case is a binomial, right, because we have a zero, one response variable, so admission is zero or one, which means that it is a binomial. And then we just print the summary, just like we did before, so we put up the model and then we look at the summary of the model. So the summary of the model looks like this. So what we can see is that everything is highly significant, or relatively significant, right? So the GRA and the GPA are significant factors which influence your admission, right? So these are numeric values, right? So you can see that the GPA has a relatively long positive effect on your admission chance, which is logical, right? Because the higher your GPA, the more likely it is that UCLA will admit you, and that is not uncommon, because, like, in the end, they have to select for something, and, of course, they're going to select based on your grades. So what we see is that all of them are different. But I wanted to zoom in a little bit to the ranks, right, because here we see rank two, rank three, and rank four. So we're missing rank one. But that is, of course, because all of these betas for the ranks are estimated relative to the rank one, because it's the lowest number. They get... Yeah, because you get a beta, which is not a beta, which is absolute, it is a beta, which is relative to the rank one school, right? So a rank two school gives you a minus 0.7 chance to... relative to a rank one school that you will get admitted. And then we can see that the lower the rank of the school from which you are applying to UCLA, the less likely you are to be admitted. But here, in this case, we get p-values for each of these, right? We generally want to only know how significant is the rank of the school in total, right? Because when we write a paper, then in the paper we're going to write something like the prestige of your high school is influencing your admission to UCLA, and then you would want to write a single p-value, not three p-values. You're not interested in the rank two school versus the rank one, the rank three versus the rank one. You're interested in how rank as a whole influences it, right? So you want to have a single p-value for the whole rank. So the question remains, how significant is the rank of the school in total? So for that, we can use a little trick because we can install this AOD package which provides the vault.test function. So we are going to use install.packages AOD, then we have to make it active, and then we can use the vault.test function, we can compute the significance of the rank variable, but we need to know which of the coefficients we need to combine together, right? Because we have three different rank variables, rank two, rank three and rank four, which are all relative to the rank one school. So I'm just going to first look at the coefficients that were estimated, right? So the intercept is negative, which is okay, and then we have the GRE, which is slightly positive. The GPA has a very positive effect on your admission, and then we see rank two, rank three and rank four very good, the exact same numbers as that we had before in the summary. So from this, we learned that from the coefficient factor, we need to have the one, two, three, four, so we want to group the fourth, the fifth and the sixth coefficient together. We could actually merge all of the coefficients into one, right? We could get a single P value, but that's not what we're looking at. We want to know the P value, the probability that rank is influencing your admission. So four, five and six are the ones that we want to summarize. So how do we summarize that then? So we can combine the significance of terms four, five and six. So what do we do? We say walt.test, the bay is the badass, so we are going to give it all of the badass. Then the variance covariance matrix of the model that I have, right? So it's the coefficients of my model. It's the variance covariance structure of the model itself. And then I'm going to tell it that I want to summarize terms four through six. So four, five and six. If I do this, I get a single probability after the model did a chi-square test or after the walt test did a chi-square test across these four, five and six terms. And then it turns out that the combined probability that the rank of your high school is determining your entry into the university is actually very significant with 0.0001. It also tells you the amount of degrees of freedom that's taken up, because we have three groups, so we have three degrees of freedom that we estimate. Good. So that is how you can use logistic regression, and this is how when you have grouping variables, so factor variables, how you can combine different factors to get a single p-value for the prestige of your high school. So normally we actually write down the badass, right? So when we have linear regression, we are interested in badass. But in this case, we have more or less an admission, non-admission, right? So it's like a responder, non-responder. So in case control studies or in these kind of binomial outcome studies, we generally report odds ratios, right? So we instead of reporting the beta, we report the chance of you being admitted, or the chance of you getting sick or like smoking increases your chance of getting sick by an X amount of percent. So we generally write it down as the odds ratio. So what do we have to do? We have to take the inverse of the natural logarithm. So this is the exponent. So what we do is we take the exponent of the coefficients. So the exponents of the coefficients will tell us the odds ratio of being admitted. And if we need the 95 confidence interval, we can actually do that as well. So we can say we can make a little matrix. So we say that our odds ratios are the coefficients. And then I just going to use the conf in function to get the confidence intervals for from my model, right? So that will give me two vectors, one for the lower percentage and one for the higher percentage. So in this case, it's the 95 confidence interval. But we're doing a double sided test. So that means that both sides of the distribution will be at the two and a half percent line. So what we see is then when we take the exponent of the matrix that we just made, we see for example, that your GRE is not so much affecting your admission rate, right? Because it's only a 0.2 percent increase in chance to be admitted if you have a high GRE. However, we can see that your GPA is a massive influence on your admission rate. Because we can see that it's 2.23, which means that having a high GPA increases your chances of getting admitted by a 123 percent above baseline. Rank 2 decreases your chance by 50 percent coming from a rank 3 school decreases your chances of being admitted by around 75 percent and coming from a rank 4 school decreases your chances of being admitted by around 80 percent. And we see here also the confidence interval. So we know actually where the real value should be in between. So that is logistic regression. And the nice thing is is using this GLM function you can do it for any of the distributions, right? So the LM function is just a convenience wrapper around the GLM function where it says that the family right, so the family parameter is Gaussian. So if you would do a GLM using a model and then set the family to Gaussian then the results will be exactly identical to when you do a standard LM function call. But we can not only specify binomial, right, 0, 1 we can also say no the data that I have comes from a gamma distribution or the data that I have comes from a Poisson distribution or an inverse Gaussian distribution. So there are many different distributions that you can choose from. And of course which distribution you should choose depends completely on your data. So the better you understand your data the better you understand more or less how your data was generated the more easily you can choose the distribution, right? So if I'm for example measuring how many B's I observe on a flower then this will be a Poisson distribution, right? Normally I will see 0 B's very often, 1 B sometimes 2 B's is already very uncommon, but 3 or more B's or 3, 4, 5 B's on a single flower will be very, very unlikely. So you get this kind of Poisson distribution. So Poisson distributions are counted distribution that's one of these things that you have to take into account because it can only be 0, 1, 2 or 3 it can never be 1.5, right? So if you have a 1.5 in your distribution it's not going to be a Poisson distribution because Poissons are counted data. So the most commonly used ones are the Gaussian, the binomial the gamma distribution and the Poisson distribution I actually never use the inverse Gaussian. So the gamma distribution is a little bit of a weird distribution because it's a distribution that you can parameterize so you can tell it what the mean and these kinds of things are so there's the gamma distribution is there to model continuous variables which are not following a standard Gaussian, but then you can use like several flavors of gamma distribution to approximate your distribution. Good, so that was everything that I wanted to tell you guys about regression because I told you guys about general linear models so when you have a Gaussian distribution all of your samples are independent you can just use the standard lm function. If you have repeated measurements so you have measured the same individual multiple times or you have individuals who are related then you have to account for this relatedness or for this duplicate measurements and then you need to use repeated measurement models or mixed models and this also holds for time series data of course. And then when you have your data not following a normal distribution then you have to use generalized linear models if you have the same problem so if you have individuals measured multiple times or you have relatedness then you have to switch to using generalized linear mixed models. Good, so I told you guys that I wanted to quickly show you guys the long format versus the wide format. And this is very common because a lot of algorithms that you will use in the future except only one of the two. So some algorithms need to have a long format data other algorithms need to have a wide format data. So going from the wide format to the long format we can use the reshape to package. So what exactly is a wide format? A wide format means that you have all of your measurements on a single unique identifier it's kind of like when you have a primary key in your in your data set. So when for example we have Sydney January so we have a city and then a time at which we did our measurements and then we have for each of the variables that we measured we have a column. That means that in every row we have the exact same thing we have a certain city a certain time that we measured in the city and then we have all of the different variables that we measured each variable has its own column then this is called the wide format. If you would want to have the same data in a long format you would have something which looks like this so you would still have the city in the month but this now would not be a unique identifier anymore because now we have an additional column telling us what was measured which variable we measured so we have for example minimum temperature, maximum temperature maximum wind and minimum temperature instead of having columns for each of the variables that we measured each of the variables that we measured now takes up a row and the value here is just the value of the measurement so this means that in January the minimum temperature measured in the city was 19 degrees celsius and so on so this is what we call the long format and there are advantages and disadvantages between using wide format and using long format but many people that actually collect their own data generally collect data in a wide format because the wide format is just easier because for every city and every month or for every animal and every data that you did the measurements you just have all of the measurements in one more or less row right so every row means that it also means that every value you see here in this column is related to the other values and that is of course not the case when you look at the long format when you have a long format data then every value in the value column is only interpretable when you know which variable has been measured alright so reshape 2 is the library that you can use to go from one from the wide format to the long format and back so how do we do this well first the question to you guys here we have the air quality data set so we use the air quality data set already in the original regression lecture where we talked about regression so I want to ask you guys is this long or is this wide so just throw your answers in the chat and I can have a sip of coffee while we wait for that because like it is very important that you can understand or that you can recognize if data is in a long format or in a wide format optimistic card thank you for the compliment like thank you yeah well English is also my second language like I when I grew up I grew up speaking Dutch but fortunately in Holland it's very common to have English as a second language so thanks so much for the compliment do you have an opinion if this is long or if this is wide and there are no well there are wrong answers but like I don't care about it's not that you get points subtracted for giving a wrong answer I might give out plus points for people that give the correct answer so we're just going to wait a little bit until someone says something in chat actually I have like some super nice audio effects for that so we can have crickets chirping in the background until alright so we have our first answer so we can stop the crickets I think it's wide and why do you think it's wide then I say long oops sorry for caps that's perfectly fine alright so the format that we're seeing here is of course a wide format no idea what we're talking about that's fine that's fine but it is a wide format and how can we recognize that well the main recognizing feature about the long format is that it needs to tell you what was measured right so if you have a table and it tells you this was the minimum temperature or it tells you this was the maximum temperature then you know directly oh it is in long format because every row has a different different value right so the value at the first value could be in degree Celsius the second value here max wind could be in Beaufort or whatever scale you are using to measure speed right so it is a very you can directly see if it is long based on the existence of a kind of variable column and since there's no variable column here right there's no column telling us what the value is this is definitely going to be wide yeah so wide is all the data on one row yeah well it does not all of the data right because you can see that five and one so the first of May you did these measurements and then the second of May you did these measurements but the big difference here is that in the in each of the value columns so each of the columns which are numeric the unit of it is exactly the same and you don't have an additional column telling you so what if we want to go from the air quality data set which generally is in the wide format to the long format well we can have but we just a very overview so the wide versus long we have an identifier has a columns that I uniquely identify a row so in this case it is the sample name do you think wide is easier to plot for base are wide is easier for ggplot ggplot requires long format it it only or it works really well with long format data if you have wide format data then ggplot is really difficult to work with but for base are it's the opposite so base are generally prefers wide format because you can just take out one column and this column has all got the same unit but ggplot is a little bit different because it likes these grouping variables so it likes to have this column where you have the description of the column so that it can know what these values mean in relationship to the other ones so it really depends on what plotting routines you are using so the first thing that we need to see here is that we have identifiers right we have several columns one or more that uniquely identify a row so in the air quality data set this is the day and the month so the month does not uniquely identify does not uniquely identify the measurements but the combination of month and day uniquely identify the measurements right but in a long format you need three columns instead of two right so here we have the unique identifier here is the city and the month but if we go to the long format then we need the city the month and we need the variable which was measured to make it unique so we have the measured variables which are the columns that contain the measured values which generally in the long format is only one column there is only one column which contains numbers all the other columns are there to uniquely identify the measurement and identify which type of measurement there was done so there are two terms which when we go from white to long format this is officially called melting so you melt your data set from white to long and when you go from long to white this is called casting which is a little bit of a misnomer in computer science because casting in many programming language means changing the type of one object so going from a numeric data type to a floating point data type or going from a floating point data type to a double but in statistics casting means going from a long formatted data set to going through a white format the data set so how do we now do this so for example we want to take the air quality data set which is in white format and we want to melt it in a long format so how do we do that well we can say melt so this is a function which is provided by this reshape to package it's not available in base R it is available in reshape to so we just say melt the air quality data set we have to specify which two columns uniquely identify each of the rows so in this case it is the month and the day we have to tell it which variables were measured so in our case in the air quality data set there are four measured variables the ozone, the solar radiation the wind speed and the temperature so we need to specify which ones are identifying and which ones are measurements and then when we look at our molten data the molten data now looks like this right so we have the month we have the day so these two identify and then we have a new variable so we have a column called variable which says ozone so this is the ozone measurement on the 1st of May it had a value of 41.0 so one of the things that we can observe here is that we now introduce missing values these missing values were also in the original data set so if we go back then we see here that indeed on the 5th of May so on the 5th of May ozone was not measured for this day so the the NA values they are here as well but the nice thing about the long format is that you can actually remove the NA's right we could just scratch out the NA and then the data set would become smaller because it would have one row less and we don't we don't have to remember the NA's in a long format so getting rid of the NA's is actually really easy because when we are melting the data set we can just add NA.remove equals true and this will only melt the unique value so when it has a wide data set and it encounters an NA then it will not reserve a row for this in the molten data set so here we just show 10 random rows to see indeed that we have also all of the variables that we specified in there so instead of just looking at the first 10 we are using the sample function to look at 10 random rows from the data set and then indeed we see that we have the wind variable measure temperature, solar radiation and all of them are there and they all have their own value so going back to white format is actually the same more or less and here the function to do this is called Dcast I think is a reserved keyword in R so the function from reshape to do this is called Dcast so here we can say take the molten data set because I stored my data set after melting into molten so I just gave it my own name but we can just specify the data set that in long format which we want to go and have in white format and then we just have to specify the two identifying columns so in our case it is the month plus the day which was melted into the variable column so the variable column tells you which variable was measured so here you use kind of this function thing or this yeah this function again so you build up a function saying that no we have month plus day uniquely identifies each of the variables and then when we look at the recast the data set then we see that the data set now looks like this this is exactly identical to the original data set except for the fact that the month and the day are now the first columns compared to the original air quality data set right so melting means going from white to long casting means going from long to white and it is relatively easy to do and you will be doing this a lot because it depends on which algorithm you use depending on the algorithm it will tell you I need the data in long format or I need the data in white format so and this will allow you to very quickly go from one format to the other good so the common idioms and I think I'm going to do a short break first we're still a little bit before time because it's only 10 to 3 but I wanted to go with you guys through some snippets of code that I use over and over and over again so these are things that I have done in my career multiple times and during this time I actually figured it out like I did it once and then I'm just reusing the code over and over again so this is called idioms we have a lot of idioms in the English language as well which are kind of figures of speech and this also holds for programming so these are just some English idioms good so with that out of the way so this is the first idiom that I wanted to discuss but we're not going to because I am going to switch you guys to the first slide show the first slide show is going to be I think sea life and we're going to have some music and I will be back in like 5 to 10 minutes and then we will continue and I think that if we're lucky we can get through all of the idioms at around 3.30 and then we have half an hour where we can do our packages but that depends a little bit on the speed that we have so I don't know if there will be 1 or 2 breaks today but I don't want to have the lecture being too long because we already had a lot of lectures but I think that the common idioms are things that since I use them a lot you will be using these things a lot as well good so time for a little break first break sea life and the song that you guys are going to listen to is called afterglow so I will see you guys in like 5 to 10 minutes with the animated gifs so I made it back in time a question from chat I'm going to start studying precision medicine and this would involve our programming being learning the basics on data camp any advice that you could give me sir well practice, practice, practice it's the only way that you will learn R or learn programming in general you just have to do it just open up a text editor like notepad++ open it up and start programming stuff if you think like I want to program something that is fun so use the emoji library in R and start doing crazy stuff with emojis the only way that you are going to learn is by typing code and typing code as much as possible one of the other ways which is really good is to go on github which is it used to be its own company it is now owned by Microsoft but there is a lot of code on there so you can read other people's code and learn from how they do things but this is generally much harder when you only know the basics but if you are interested in learning how to program follow the lectures on here this will give you a very good basis to start off with because we talked about introduction into R writing a little bit of scripts then we go and reading your own data learning about distributions normal distributions uniform distributions and these kinds of things and then we just continue with doing some basic analysis using linear regression and that is more or less the basis of everything and from that you can start answering your own questions and these questions just sometimes come up I can still remember one of the things that I programmed which was because I wanted to figure out if there was a relationship between the weather and stock prices and of course for some companies it is very logical to understand that that is there if you think about an ice cream company you would guess that the ice cream company makes more money so the stock prices would go up when it is summer to prove this you need data so then you just open up google and you see can I get stock prices historical stock prices from somewhere and then you end up with something like google financial where you can download 20 years of historical records on stock prices and then you think okay so now I have my stock price data and now I need data on the weather so how do you do that well you just start googling around and see if you can get a meteorological station somewhere which provides you with weather data like temperature data, humidity and these kinds of things and then when you have both of these data sources you just write a little bit of code to connect these things together so in the end it is about asking questions and these questions can be as dumb as possible because you want to start off very very simple you want to start answering very simple relationships right one of the things another example is a friend of mine once asked me like isn't it weird that all of your best friends seem to have their birthday in a very short period of the year and I was like thinking about it thinking about who are my best friends and when is their birthday and I was like yeah that does seem very clustered right and the funny thing was like he was born in the winter and I was born in the summer and actually our friends were all clustering on the opposite of the year so most of my friends had their birthday at the beginning so between February and April while his friends were more at the end of summer like August, September, October so then we thought like oh so how do we figure out if we can prove that right is there really a relationship between when you were born and when your friends are born like the month of the year and of course this is something that like you just have to invest some time in like you can just get a little notepad then walk to a random stranger on the street and then you ask him okay so when is your birthday and when are the birthdays of your three best friends or your five best friends and after asking like a hundred people you have a little data set together and you can start answering this question and one of the other ways that you could do this in the past was by just going to Facebook right and then if people had their own birthdays publicly available on Facebook and their friends also had their birthdays publicly available then you could just use web scraping so you just would read in different web pages from Facebook like just public profiles and then you would put this all together in like a big matrix I think one of the metrics that we used was how often people wrote on each other's Facebook wall to kind of get an idea of how well they were friended together and then we just did some statistical testing and it did it does turn out that when you are born is a pretty good predictor to when your five best friends are going to be having their birthday so there is a relationship if you are a summer child it generally turns out that the friends or your best friends tend to be end of winter while if you are a winter child then your friends tend to be born it is not a universal constant but it holds so that is my advice, just start coming up with questions that you think are interesting to answer and then just start writing little programs for it and if you don't if you are not able to get the data then just start simulating data this is one of the really really nice things about R is that it has all of these built-in distributions so you can actually just generate a data set from nothing so you can just say well I imagine that it would be like this, this and this and then you just program that in and you just start generating random data based on some of your assumptions and that is what R is really good for so that is my advice the only way to get good at programming is opening up a notepad window opening up an R window and start typing code in the notepad window and start executing code in the R window that is kind of the whole idea behind programming good so back to some common idioms so one of the things that very very often goes wrong in R because people are not aware of this or they are aware of it but they don't check it so when converting from a factor variable to a numeric variable always go via characters because if you don't you run the risk of turning the number 5 into a different number so I gave here a little bit of an example so I have a factor that I create so I have the numbers 5, 10, 11 and 5 and these are groups they are not numeric values so to speak that's why I actually put the quotes around it to make sure that they are character typed and now if I convert them to a factor and I just say as numeric of my factor then it gives me back 3, 1, 2 and 3 which is not what people would expect people would expect to get back 5, 10, 11 and 5 but that's not the case and this is because we'll choose or we'll store factors internally as integers which have no relationship to the number that was represented in the factor so if you have a factor variable which is numerically coded and you want to go and have the numbers the original numbers back like with the rank of the high school make sure that you always go via as character a factor transform it into characters and then transform the characters into numeric values and this of course makes a big difference in analysis so sometimes when you do an analysis then it just the answers don't just seem right it seems very wrong and you have this feeling like this can be true almost always it is as numeric on a factor your data distribution so always go when you have a factor go to numeric values via characters alright so more common idioms so one of the ways that you can save yourself a lot of time is by pre allocating lists imagine that I have a big file and I'm going to compute like 5 or 6 statistics for each of the rows for each line in my file right for example I might have a FASTA file where every row of the file has a genome sequence and for this sequence I'm going to to count the number of A, C, T's and G's and I'm also going to compute the percentage of them right so then I have multiple values per line of the file that I want to remember you can just use the C function right to combine different objects that you create you could compute all of these things put them in an object and then see those objects together but it is generally more efficient to pre allocate a list if you know how many lines there are in the file right so how do you do that if you want to create an empty list then this is the way to do it right so you say my list is empty and the vector has the type list and then it will have five elements so this will create an empty list which already has space for five things that you can put in be it matrices be it vectors, be it other lists but pre allocating them is a very good idea because it is generally much much faster than using C bind or row bind or the C function itself which might need to reallocate or move stuff in your random access memory so by using just the vector list beforehand saying that no for every row in my matrix I am going to compute certain things I know that my matrix has 500 rows so I am already going to make an empty list with 500 slots to put things in then it does not need to reallocate this list half way through so it saves on trashing your random access memory if you want to remove a single column from a matrix I always use this structure I never say minus and then the number of the column or the column index I remove columns by their name I always try to work with matrices which have row names and call names because I have been trained that people will send me the same data set at least three or four times during a project and what I have observed is that I will always get back a slightly different format so they will squeeze in a new column because half way through the experiment they notice we forgot to measure X so let's start measuring X from now on and put in a new column in the matrix and then everything shifts so if my script is removing the ninth column from the matrix then getting a new data set the ninth column all of a sudden is the eleventh column and then I have to go through my script and everywhere where it's written nine I have to change it by eleven so to prevent that use the row names and the column names of a matrix so you can remove a single column from a matrix by using this command so if I have a matrix from which I want to remove a column then I just say call names from the matrix this is some call I ask which one this is and then I just put a minus in front of it and this will remove this column from the matrix no matter if it's the ninth column the eleventh column or the two hundredth column it will always remove the column with this name so in this case some column if you want to remove multiple columns it is very similar but now instead of using the this I want to have an equality between the column names and the call names I just use the in function so I'm going to say which column names of the matrix are in the vector of column names that I want to remove and then just say minus to remove them and you can use the same strategy for row names so often you want to throw out some individuals because measurements were not properly done or there's a comma failure or some other thing so you use the exact same structure but instead of saying call names you now say which row names of the matrix are located in the vector that I want to remove so here we have for example individual one individual eleven and individual fifteen and then you just say minus and then this will remove those rows from your matrix one of the things that often goes wrong or not so much wrong which professors and journals will often complain about is that you have this beautiful matrix but we want to have the exact number of digits for each of the values in one of the columns of your matrix and this especially happens when you are working with things like p-values or if you're working with big numbers then you want to have always the exact same number of digits and generally a value like 0.01355 will not be written in scientific notation by r when you make matrix it will just say 0.01355 so if you want to force r to use scientific notation which is preferred above non-scientific notation you can use the format function so you can just say format the number or you give it a vector of numbers and then you say scientific is true and then you say digits is 3 so then it will look at 3 digits behind it will look at the first 3 so it will use 3 digits for everything so I now see that the example here actually only has 2 digits but that's because I changed it and that's not true but you can also go the other way around if I have a number like 0.01355 times 10 to the 11th I can say do not use the scientific notation give me all the zeros and then now I will get a character string which has 0.00000014 and this generally is something that professors complain about or that journals complain about because they want to have the column of a matrix or the column of a table in your paper that it's always exactly formatted with the same number of digits behind the comma so in this case I say digits is 2 which means that the 135 automatically gets converted to 14 because it sees this as a 3 digit number because it's lower than 0 so use the format function to go from scientific notation to normal notation or from normal notation to scientific notation and you can directly round it more or less using the digits argument what I do a lot is a string split I am someone that does not like working with a grep or regular expressions or these kinds of things I like splitting strings I like the power that it gives me so if you string split it splits a string into individual elements using a splitting value for example here we have a vector and this vector has a couple of dates in there so the 3rd of January 2020 the first of well no this goes wrong because it's month, day so this is the 3rd of April the 4th of January and the 20th of January so what you can do is you can use the string split function to split this because then it will just every time it sees a dash it will split the string and then it will return a list this list has a length of 3 because there are 3 elements that are being split it and every element will have a vector of length 3 and the first vector will be 0.3 0.1 2020 the second vector in the list element will be a vector of length 3 which will have 0.1 0.4 but if I want to select part of this then I can use the L apply function so I can say after I split these elements L apply to this list that I have and then I can use this selection operator like the square bracket selects from a vector or selects from a matrix but you can actually apply this function as well so I can say well give me all of the years so I say L apply to the split values what do I want to do I want to select from these and I want to select the third element and then I get the years and of course I unlist it so that it goes back into a vector or you could use S apply instead of L apply but you can do the same thing for getting the months in the first position so I'm saying L apply to the string splits or to the split values the selection operator and select the first element for each of them and this is something that I use a lot and it is very useful and this L apply of this selection operator is something that comes back and back so that's why I think it's a common idiom which should be mentioned we can actually also do partial string matching so we didn't talk a lot about grep and gsub and these kinds of things but I think it's important to realize that they are there so you can use grepL so that's grep logical to do partial string matching so if I have for example if I want to know every string in this vector that I created above that contains 20 which would be the first one and the second one I can use grepL so in this case grepL will tell me true false true if I want to know the indexes then I can do a which on that so I can say which grepL 20 as vector and then it will tell me 1 and 3 so grep itself is actually defined as as which grepL so instead of using grepL then I can actually just use grep so I can say which grepL 20 as vector but actually what I'm typing is grep20 in the s vector so grep is the same as grepL but it includes the which on the output so this is just a piece of code that I wanted to give you guys and this is an interesting piece of code because people always complain about that it's wrong or it's imprecise which is kind of true but often in biology we are measuring animals or humans or plants and there is always a big effect of when you are measuring things because if I have mice which are born in the summer they tend to be a little bit lighter than when they are born in the winter if they are born in the spring or if they are born in the autumn they have a slightly different average weight compared to when they were born but the problem is is that if you use the month to correct for this effect you are actually estimating not a single beta the beta for month you are actually estimating because month is a grouping factor so it's a factor so what happens is that you are losing 11 degrees of freedom by fitting the month as an individual parameter so often you don't want to fit these months as a covariate but you actually want to fit the season as a covariate because that only has four levels it's either spring, summer, fall or winter or if you are on more or less near the equator then you only have two seasons which is the rainy season and the sunny season so then instead of using a linear model where you say my phenotype is controlled by the month at which you are measuring it you are almost always better off by posing the model saying that the phenotype that I'm interested in is controlled by the season in which I have measured it and this is a very basic piece of code which will allow you to take these things which are more or less in month or year month not year month, day month year format and then just output transform this whole vector into a vector which is spring summer spring or spring fall winter so it converts the vector even if you would have like a million dates it would return to you a million seasons which each season corresponding to the proper date that you have so it has a POS parameter and this POS parameter tells you which which element in the in the date contains the month because it is different from Europeans and Americans so we have a much more logical system as Europeans saying day month year but the Americans like to use month, day, year so that is why we have this POS parameter so that you can say well if the data comes from an American then the position is 1 because the month is in the first position but if you have a European generating data then the month generally is in the second position so then POS would be 2 so what it does it uses all of these things that we just saw so it string splits then it L applies using the selection selecting the first or the second one and then it unlists it and then it converts it to a numeric value and this is called months right and then I make an empty vector so I repeat NA for each of the months that I extracted and then I am just going to do this little piece of code where I say that well the return value if the month is larger or equal to 3 or smaller than 5 then it is in spring otherwise it is in summer or in fall or in winter and then I just return the whole vector and this is relatively efficient in taking like a million dates and converting it to the corresponding spring, summer, fall and winter so the thing which people always complain about is that well you are just using the month and you should actually be using like day 21 right because you go for metrological summer, metrological winter and these kinds of things are similar to March that like the 21st of December and not at the 1st of January so from my experience for linear modeling this makes no difference it makes no difference if you use the accurate metrological date or if you just use a month to convert to the season that you are currently in so the season is just the winter is the 12th the 1st and the 2nd month but of course this is not entirely accurate until the 21st of the 3rd month but I never had any major differences between using this little piece of code or using an official translation and the official translation is much harder because then you have to account for leap years and other stuff and it just becomes a pain in the ass to kind of write that code this is relatively simple code just copy and paste it in and use it whenever you need to convert from months into seasons if you really want to do it properly then R actually has a special data type for holding dates and times so hey you can use the ass date function to transform a date you can specify which format it is specified in and then you can have a date 1 this is a date 2 which is slightly differently coded right so it's the same it's a month day so it's the American format but this one uses slashes and this one uses minus so you can just specify the format yourself so capitalized means that you are specifying the years as 2020 a small letter Y would mean that you have years coded as 2019 so without the prefix you can actually do computation with dates so if you have two objects which are date objects in R you can just subtract them from each other and it will give you the date the difference in days between the two times you can also use the sec function with dates which is generally amazing because if you just say sys.date which will give you the current day of today what you can do is you can say take a sequence from today go 10 steps out and step every time by a week then it will give you the exact days for every week and it will account for leap seconds it will account for like leap years and other weird things like time zones and these kinds of things so it actually works really well when you use dates as a date type in R it will account for a lot of these non, well not nonsense but a lot of these intricacies in dates and times one of the things that I also do a lot like one of the common idioms that is using gsub so gsub is very powerful it's the global substitution tool where you can say well I have a vector which has three sentences in there and in every sentence I want to change the word fox by the word cat then you can use gsub which is global substitution or grep substitution based on how you want to see it and it will just start changing every time that it encounters the word fox it will change it by the word cat so very, very easy and has substituting and replacing in text vectors is something that is also very powerful especially when you're dealing with poorly formatted data the gsub function can really save your ass if you want a very good example of this look at the fishy data lecture from last year because there I used the gsub function to very quickly harmonize between different formats because there they've measured fish in three different years and the first year they just used the normal name the normal German name then the year after they used the German name plus the official Latin name they used a combination of the two but to harmonize this data a gsub function saying no change this by that is very, very good it is also very good to fix typos in your data set it often occurs that people make typos especially in big data sets and then the gsub function can easily say well no use the word or have gsub the word with the typo use the word that doesn't have the typo and do that for the whole column just to harmonize your column and make sure that typos are more or less removed because you don't want to go into your raw data raw data is read only you should never start fiddling with raw data that you get because then the next time you get the same data set you have to start fiddling all over again fix typos in the code so just read in the data and then start fixing typos so if we have some time later on I can show you what the problem was in the fishy data set and how I kind of solved this using gsub but typos occur a lot so make sure that you fix them in code don't fix them in the original data set because then your data set is different from what your collaborators have and when your collaborators send you an updated version of the data you have to go over again so be smart use r to fix typos ordering a matrix by a certain column or multiple columns for that matter use the order function the order function is very very very powerful so what I'm doing here is I have my matrix I want to order it by some column so I'm just going to say order my matrix by some column so I can select the ordering and then I can use the ordering as the index vector into my matrix to store the matrix again but now sort it by this column order has a parameter called decreasing standard I think it is false or standard it sorts in an increasing fashion sometimes you want to sort the other way around then you can just say comma decreasing is true from the highest and sort to the lowest so this is what I was talking about by pre-allocating so pre-allocating is sometimes much more efficient so here I'm not pre-allocating so here I'm just saying I have something called results and now I'm going to go through each of the rows of my matrix I'm going to calculate the mean the standard deviation and then what I'm going to do which has these two values the mean and the standard deviation and I'm going to row bind them to this resulting object so this is relatively inefficient when you're doing it for 100,000 or a million objects if your matrix is having many many rows then this will become very slow because every time that it does the row bind it needs to sometimes not always but it very often needs to reallocate more memory so it needs to move the matrix out of not so much out of it but move the matrix from one position in the memory to another so here if you know how many things you are going to get it is much much quicker to write code which is pre-allocated perhaps it's good to give a little example of this because I've we didn't have any assignments so I didn't get to program for an hour at the beginning of the lecture so let's just make a little example to kind of time how much difference there is between pre-allocating it and just going through it in a naive way so let's just do that so I'm just going to take notepad and I'm going to say I'm going to make a little example on why you should pre-allocate and I'm going to save this as example.r and then I get code highlighting which is really nice so always save your file with the proper file extension so I'm just going to make a big matrix so I'm going to say big matrix is a matrix this matrix is going to have uniform numbers I'm going to draw like a couple of them and then I'm going to say this matrix is 1000 rows and it has 10 columns and then let's just make sure that we have I think that's good so if I would go to r and I would put in my big matrix so let's go to r where's my r window there it is and if I look at big matrix look at the first 10 rows then I see this so it has 10 columns 1000 rows so imagine now that we want to do the thing that we just did on the slide so I'm just going to say results initially is empty for x in 1, 2, the number of rows of big matrix what are we going to do well we're going to calculate the mean of the row so big matrix x, so this is my mean and then I'm going to calculate my standard deviation which is sd and then I'm going to just say well my results are going to be a combination of my mean comma my standard deviation and I'm just going to row bind these two results so this is just naive we are saying that initially we don't know any of the means of deviations and we're just going to go row by row, compute them and add them to a new matrix if we want to if we are dealing with big data then it's much better to pre-allocate so in this case I would say that I have results which is a matrix because and my results is a matrix initially filled with na values this one how many rows will it have the same amount of rows as big matrix and it will have two columns one for the mean, one for the standard deviation then I can just do the same thing again but now instead of saying results is row bind I'm just going to say that results on row x is this combination of my mean and my standard deviation like we did before so now we can time it let me see sys.time I think that's the function no that's not the there oh yeah system.time so now we can use system.time to time it so we can say system.time and then we want to time the whole expression that we want to time is first the first one I think I'm allowed to do this so just define a block and put this whole thing into the block and then I'm going to give it an additional tab so that I know what I'm measuring and I'm going to do the same thing for the second solution and I'm going to top it and I'm going to close the block and I'm going to close the sys.time function so now my expectation is that version number two is going to be much faster than version number one because I pre-allocated the results right I'm only telling our ones to allocate memory for me and that is here alright so let's see what kind of a difference this makes in r so let me first run the first one alright so it will take some time right because it needs to compute all of the means all of the standard deviations and hey it goes through row by row by row so it's a relatively expensive operation but then we can see what the difference is so the first one is just running and timing and then we can wait for it because it's not that big of a deal we have enough time to go through the whole thing la la la la la la la la la la la la la la la la la la alright I'm just going to let it run but we'll just have to see how it works out I might have given it like a thousand rows or like a fucked or 10 less because now it does take up some time but that's okay we have time right the lecture is here until 5 so let's see okay so here we see that the first approach that we took in total 38.77 seconds in user time 28.31 seconds in system time and in total it took slightly over a minute so one minute 8 seconds and 59 alright so now let's take the second approach and see how long that takes and you see the difference right it's the exact same code we computed the exact same thing but by pre-allocating my result matrix and not using our bind our code now ran almost 30 times faster 30 times imagine that you have an analysis which takes a week doing it 30 times faster means that you can do it a couple of times per day instead of every time having to wait a week right so this is one of the tips that I want to give you guys pre-allocate the matrix beforehand where you are going to store your results right it's a massive massive massive difference that's not the sys time I need yeah sorry sorry I was thinking about sys time is the current time of the system and I need system time to measure the speed right but look at that difference 30 times speed up this would get you a promotion in many many different jobs if you are a programmer and you are working for a company and you tell your boss I made a code 30 times faster then the guy will say here's going to be your promotion like you are now instead of a junior developer you go to senior developer and here's 50,000 euros extra that's how this works so pre-allocate your results don't start trashing around your rum because that's what's happening here every time that you do a row bind it needs to allocate more space to make the matrix like one row bigger and it does this 100,000 times and you can see that most of the time is actually spent in the in the sys time so hey if we look at the code here then the system here took 28.31 seconds to allocate all of this while here it only took like 1 tenth of a second so 30 times speed up the boss and that's how it works so pre-allocate this is key this is why you have good programmers and this is why you have naive programmers so just a tip good more common idioms force some columns of a matrix to be numeric this happens a lot so it happens a lot that you read in your data from r or from a file and r starts doing smart stuff because r tries to be smart and it sees like it looks like numeric values but I'm just going to make a factor because a factor uses less space so how do we now force a column of a matrix so we're going to force the first 5 columns of this matrix now to be numeric values so what I'm going to do I'm going to say apply to my matrix to the first 5 columns and then to the column of each matrix I want to apply a function of x so this is just a very basic structure that we always have when we use the apply function so what are we going to do we are going to say as numeric as character x think about the first common idiom if you would not have this as character here you would run into the issue that if it happened to be that one of these columns was coded as a factor then it would eat up your numbers so when you transform stuff in numeric values and you are using as numeric always do as character first so make sure that it's a character first and then transform it to a numeric if you are writing functions for other people and you have for example a function which has different methods so different ways of computing something if you think about correlation you have Spearman correlation Pearson's correlation Kendall's tau and other types of correlation so you would say well I have a function something on a data set and the user can specify a method you are only providing an x number of methods so methods here is not a free choice it is a it is a limited choice right there's only five ways of computing correlation or your method only supplies three of them right so then you make a new variable which is called supported so these are the methods that are supported by your algorithm and then of course you want to check if the method that the user input it is actually in the list of your methods then you can use the P match function partial match so that means that in this case it's a little bit of a weird example because it's not the methods are not fully written out but P match what it does it matches the thing that the user input it with the supported list and if there's a partial match then it will give you method and method here is now a numeric value or an A so it was if you input it an A then it will return one if you input it B then it will return a two but if it is an A that means that the user did not specify a valid method now you can easily use this and this is one of the common ways to kind of check and you can actually add the supported to the method themselves so you can even make it a little bit better but this is a way of making sure that people that use your code are able to specify their method and also they don't have to type the full thing so let me give you a slightly better example than the one on the slide so we have example example which is a function of X and then we have a method that the user can specify and I'm just going to call it XOMP so then we are going to use the code that we had so we are going to say support it is for example Pearson correlation and this needs to be a factor so we are providing Pearson correlation we are providing Spearman correlation and we are providing Kendall correlation Kendall, I think it's with double L and now we can say method is P match the method that the user chose together with the supported methods and then if method is an A then we want to error and we want to say stop method not supported supported and if the method is supported then we just want to return method just to make sure that we return a value from our function so now the nice thing is that I can do XOMP on X which doesn't matter because I'm not doing any computation anyway but I can now say P and this will now match to Pearson and I can do the same thing here I can also say Peer, I can say P and I could say can so it does not force the user to put in the whole method name because it's a partial match so let's see if this works as intended so you can see that when I input Peer it will know that the first method was selected when I input a P then the first method was selected and when I input can the third method was selected because of the partial match you can have longer names for the supported methods and you can actually make this still a little bit shorter by saying by leveraging the fact that we can have the full parameter so I can say method is see Pearson, Spearman, Kendall and then I can actually do P match no, I can't no, no, no, that's not possible anyway so this is a way that we can use and the user is not forced to type in Kendall or type in Spearman you can just use s for Spearman so very useful when you are writing code which is used by other people because you are allowing the user to specify the method with less characters than he would normally have good so we already saw the invisible thing which went horribly wrong when we did recursion of the lecture to go so I totally messed up so I made a slide about invisible so when writing a function you can use invisible to return results without having them roll down the screen and I made it like this so I am going to say example is a function of X and I am just going to repeat this thing one million times and by putting invisible around it I can say I run a million times one across my screen but I can still get the value right I can still store it in a variable and then show the variable so it's just that when someone a user of your code accidentally calls a function but forgets to store the result of the function into a variable they don't get bothered by having to wait five minutes for all of the ones or numbers that you generated to show to the screen so the invisible keyword is very very useful for kind of hiding these user mistakes right because generally a user wants to assign output back into a variable but if they just forget then you don't want them to have to wait a long time for all of the results have scrolled through the screen good so do we want to do script from the command line I don't know do we want to do a break first or do we just want to continue I think there's only a couple but I think that the pipelines thing is a little bit separated although we still have the we still have the R package creation I think we will do a little break first so let's do a break again five to ten minutes and then we will be back the second break is I have to really think second break is gifs of dogs I would say I think it's dogs I hope we didn't have dogs yet but anyway so the song that you guys are going to listen to is called breathe or breath depending on how you pronounce it in English and then I will be back in like five to ten minutes depending on the length of the song so I will see you guys then in five to ten minutes enjoy the animated gifs I cut out my mic too early and I will be back in five to ten minutes so see you soon and enjoy the gifs super sleepy music that was my bad so we had some discussion in chat about how to accurately measure time and actually R can measure time pretty accurately I think with even millisecond or even more accuracy it's just that you have several time functions that can do this and the one that we used let me actually open up the notepad right so here we use system dot time but let me see because there are a couple more that you can use so proc time is the one which is the highest accuracy so if you want to measure CPU time taken not time in the sense of in the sense of how much time in millisecond something takes but time for the CPU because sometimes you wait for an hour but the time that R actually takes is less than that or if you're doing something multicore then you can wait for 15 seconds but the CPU is actually spending like a minute right because you have four cores running on the analysis and in the end you only want to know how much time the CPU spend on things and not so much how much time you had to wait for something so there's proc dot time for that so that's you can do like still need to be a vector without quotation marks I figured it out okay good because normally when I measure time right you can do it using this system dot time command but the same thing would be to say proc dot time right and so this is my start time then you do something which takes some time like this one which is relatively fast and then you would just say if you want to know the difference you would just say proc dot time which is the current time minus s right and then it will give you the time that the process took so in this case this should be very similar to before 4.3 seconds because I didn't I didn't paste in the enter so if I would paste in the enter then it would directly do this and then generally you're only interested in the elapsed time so again 2.8 seconds system time and the user time and this has to do with if the CPU is in user mode or if it's in system mode so if it does like operating system calls for me it's for my data logger when the time is let's say 12 o'clock it logs all sensor data so you would just want to get the current time I think that's just cur no that's not cur let me see there's date is for date so you can then use sys dot time with a capital so this right and this also includes time zones and it deals with time zones and these kinds of things so if you want to do something like this then we're not interested in A we're not interested in June we just want to get the time which is then the percentage x like this and then it will tell you that it's 4 p.m. 34 seconds but this doesn't have like a very accurate it doesn't measure milliseconds precisely so system no capital you'll figure it out anyway let's start with script from the command line because we still have like an hour left it did want to talk about the R package but first is R is a really good language to build pipelines in and this is because R makes it really easy to build pipelines where you do several steps where each of the steps depends on the previous one and for this you need two things right because you need to be able to execute R scripts which take something from the command line input right if I would um well I have some examples on the slide you can use external servers and clusters via SSH so if you want to run R on an external server you don't have an R window like this right you just have a a terminal login so you SSH to the machine where you want to execute R so then you have to use the R script command to execute a script and then you want to give it some input parameters for example the file that it should be working on right so you write a script that encapsulates the whole analysis and then this script allows you to read command line parameters for example to change the input or change some of the script parameters right but when you are writing scripts that run on a server or that you're going to submit to a cluster you have to be aware that every path script through the script needs to end in you quitting the R session because if you don't then R will just continue waiting and waiting and waiting until someone gives it the quit command but when you for example know hub on a server so you execute R on a server and you don't have any any input left to process then it needs to quit the error you need to quit the R script right and normally this is not an issue because you just close the R window it asks you do you want to save and you just say no and then everything is fine but on a server or on a cluster you submit a job and then you don't have any interactivity anymore because you will submit your job the scheduler will take it and then it will at a certain point start running your analysis but then when your analysis is done you don't want the R window to stay there you want it to exit so make sure that every path through the script ends in a quit and never save the workspace that's one of these things that we talked about in one of the first lectures never ever save your R workspace because next time you start of R it will start loading in the workspace that you saved the last time which sometimes is what you want 99 out of 100 times is not what you want so for example if you want to have a little script which reads command line arguments then this is more or less how you do it in R you have the function called command arcs trailing only is true so then you won't get the name of the program and then you can store this in your variable yourself for example if you want to write a script which only takes one parameter as an input on the command line then we can say well if there's less than one stop and then we need to quit always quit the script otherwise it will just give you an error and just hang there because it will not know that it also needs to quit afterwards then what we do is we take the command line argument convert it to a character and then we print out the argument and we see the screen and then we quit at the end so why are we doing all of this well generally you want to write little scripts so a script takes a file name as an input it does something with this file saves the output and then quits and then you don't have one file but you might have 100,000 files or you might have 10,000 files and you want to do the same thing for each of these files so then you need to be able to specify the name of the file and the nice thing is you can use r to call other r scripts right so it's like inception where you have scripts calling other scripts or calling other external programs waiting until they finish and then doing something with the combined output but getting the command line arguments is using the command arcs function in r and generally you only want the trailing ones because the trailing ones means everything after the name of your script so you open up a terminal you move to where you stored your script using the cd function and then you execute your script by saying r script my script.r and then you give it an argument right so I did the little script that is here so I typed that in and then I went to this position where I had saved it so I say cd d drive github that's where I saved my script and then I say r script my script.r something right and now something is the thing that I input to the script so I can just give it via the command line I could have typed r script my script.r anything or r script my r and the trailing only means that I only get the something and I don't get the my script.r because generally the name of the script is the first command line parameter so we just execute the script right so why is this very useful well you can combine this when you want to execute external programs and build pipelines right so you can build pipelines composed of many different programs if you think about the analysis of next generation sequencing data then there's a lot of steps that we need to go through right we get reads which come out of a sequencer then the first step is to cut off the trailing ends so the ends of the threads which have very bad quality because the more they spares a sequencer reads the worse the quality becomes and we want to have a single cut off saying that well when you are this uncertain we don't want to have the rest of the letters because we don't trust the rest of the letters so after you've done that after you've trimmed the reads then the next step is to align them to a reference genome after aligning them to the reference you need an indexing step and then you want to get some statistics and then after you get the statistics you want to call for example some SNPs on your sequencing data right so all of these steps can be implemented in a single R script which then calls out to external programs to do the individual data analysis and R is just there to do more or less the input and the output and of course you can imagine that I write a script for sequencing which works on a single input file but if I send a hundred animals for sequencing then of course I need to do the exact same thing for all hundred animals right so then I just have one script which takes the name of the file as an input and then runs all of the steps writes out the output file and then I can just use my command window here and I just say R script my script blah and then I do R script my script file number two R script my script file number three or I could actually write an R script that generates all of these hundred commands for me because I don't want to type in a hundred times the same command right so there's often a need to analyze many files in a similar way like doing image processing of all images in a folder right when I have an image which is a photograph of cells and I want to count how many cells there are then I can use some software to do that but I generally don't have one photo from a microscope I generally have hundreds of photos from a microscope over time and I want to know not for a single image how many files there are but I want to know that for all of them right so that is when I start writing a script and the script as an input takes the photo which needs to analyze and then I write another script which calls the script which takes the name of and which then generates this 100 commands for me or 200 commands right and this is because many analysis are not available in R but only as like command line tools so if you think about Blast where you take a sequence and you you find compatible sequences in a reference genome the variant effect predictor where you look at SNPs and you give them very and if you think about image J which is an image manipulation program and there are many many other tools which are not available in R but you can use all of these tools from R so how do you now execute an external program you write your own execute function right because you have the system function in R so system is just doing a system call so it takes as an input a a command which you want to execute so let's just do an example of this so I can do system right and now I want to for example do a dear command right because there is a command in Windows Windows understands this command and then it gives me just all of the files which are in my directory right so this is executing the external command and then giving me back the output I can also store it and then I have to say in turn is true I think to do this right and now I have a list of all of the files on my hard drive I could have used the dare function that is built in into R as well but I can also use the function which is provided by the operating system right so I could also do a command which is different right I can do a command which is for example BlastN right I don't know if I've BlastN installed on my computer but if I had BlastN installed on my computer then something like this would call Blast and indeed I have it installed right so I have BlastN installed and then it would give me an error say I did not provide the database or the sequences right because I didn't do that but I can just give it a string and then it will give this string to the operating system and then the operating system is going to execute it right so there are a couple parameters and that's why I generally write my own little execute function which looks like this so it takes X which is the command that I want to execute and then it says in turn is false quit on error is true right because I want to quit when error generally what I do is I cut so I make sure that my my little execute function always gives me four dashes and then the command it's going to execute it will then execute the command using the specified in turn parameter and then if it is not internal then I want to cut the result right so result one will be the exit code of the program right because a program can fail or it can succeed and that is specified by the exit code of the program so if the program fails the execute the exit code of the program is generally larger than zero if a program succeeds then the exit code is zero so if it's not an internal command and the result one is not zero and quit on error is true then I have to quit right so it's just something that makes sure that every time that I call an external program and this external program generates an error then I am going to quit and then I return the results invisibly but I could have just used the result right so for example what can I do I can say execute the dear command do it internally right store the result in AA and then what can I do well I can say for each file in the directory so for X in AA for each file in the directory what do I want to do I want to call this little R script that we just made so I want to execute the following string so I have to paste two things together so I'm going to say execute our script the name of my script and then I'm just going to give it the file name so now this little script that we wrote every time gets the file name of the next file in the directory and I can actually run this in in parallel as well or we can call our own script with any other argument that we specify right but this is very useful because let me show you guys an example of my my sequencing run if it's still on my hard drive just so that you can see how nice R can be used to create pipelines because I think it's something that R is really really good at and I don't have the idea that I'm explaining it very well so let me see because then we want to go to how Berlin become the mouse then we want to go to sequencing and then we just want to get the pipeline all right so no no plus plus that should be it right so here you see my my pipeline right so this is the pipeline that I wrote for sequencing so you can see that it starts by reading the command line arguments and then it takes from the command line arguments the sequence ID the sample ID and in this case we're doing paired and sequencing so every sequencing run has two files that I'm getting the left reads and the right reads right so then what I'm going to do well I'm going to make sure that I get an output folder I have a folder where the reference genome is located in then I have a direct path to the reference file I have a log file that will contain all of the steps in my analysis and then I have an output base because I'm going to generate a whole bunch of files in the meantime I'm just going to say well I have my sample ID so in the output folder I'm going to make files which a sample ID is the base and then they will have certain extensions based on where I am so the first thing that I'm going to do is cut a whole bunch of stuff to the screen so that I know that everything is properly read in right so that the sequence ID is good the sample ID the reads are fine and this is also going to go into the file so here I have my little execute function which is more or less the function that I always more or less use this is slightly simpler than the one that I showed you on the slide but then the first thing that I'm going to do is call Trim-O-Matic so this is a little piece of code which is there to remove the adapters and then trim the reads based on some quality scores and this step of the pipeline will take 1 to 2 hours based on the size of the input file so what is it going to do well this whole thing does nothing more than just generate a command right so if you have ever worked with sequencing data then you know that the commands that you type in linux are long they are literally hundreds and hundreds of characters long and it's a single command you execute it it starts running and in the end it generates an output file so the next step right I'm going to execute this command and then the next step is that I'm going to align my reads using BWA the burrow wheels aligner which is one of many aligners that you can choose so I have my command here I make sure that my command is here and what I'm actually doing here is I'm piping several commands into a single command so that it does all of these three steps in one go right so it just generates a very long command again in for the operating system and then just executes this command then the next step is adding read groups then we have the next step which is moving the file then we index the file then we mark the duplicates then we index the bump file then we do base recalibration base recalibration some more so this is where GATK is then we have the base recalibration and then we just execute like if we need to do base recalibration apply the scores analyze the covariates and then we want to index the file again and then we get some basic statistics and then the script is done so this is a single pipeline it takes input parameters to specify which sample I want to align where I want to store the output and it takes also some of the parameters for the sequencing so it's a relatively long script with 145 lines of code and it will call around 10 external tools in order and only call the next tool when the previous step finished successfully right because every time that a tool fails there might be all kinds of reasons why a tool fails then it will just stop executing at that point and the nice thing is it will also check if the output files are already there and skip that step but that's kind of how you can use R to build pipelines by doing all of these little external tools in order in sequence of each other good so then this is more or less the end of the lecture I told you guys about generalized linear models generalized linear models we use when the response variable is not a continuous variable right it's not a normal distribution or it's not it's counted data or percentages instead of having a continuous variable I told you about the long versus the wide format I told you about some common idioms right so how do we how do we split a string I showed you guys that you can make your code run 30 times faster by pre-allocating your result matrix right so instead of row binding 100,000 times it is much better to just generate an empty matrix up front and then put the data in at the positions where it needs to be I also showed you how to execute scripts from the command line right using the R script command then give it the name of the script and then you can provide additional input parameters to tell the script where to load its data file which image it should analyze or which sequencing file it should start aligning and I also told you guys that you can use R to write pipelines which execute external programs like BLAST or VEP or image GA or whatever program that you can install in Windows or on Linux you can execute this program on the command line via R as well and you can also get the output from this program back into R by using the intern is true versus intern is false good so I have prepared a whole presentation as well about how to build an R package and I think that I'm not going to do that today I've already been talking for almost two and a half hours which is not the allocated time but it is relatively warm it's relatively humid I'm sweating I think many people have already fell asleep which is perfectly fine but for today I'm done and the how to build an R package we will probably do next week from Rotterdam so if I have good internet on the see that's what you get when you stream for two and a half hours so when I have enough internet next week on the conference then I will show you guys how to build an R package and otherwise we will do that the week after unfortunately I haven't gotten any requests from viewers or students yet to analyze one of their data sets so again I want to make a call and say to you guys if you have a data set and you want me to look at it and make a lecture about it let me know as soon as possible last year we did the fishy data analysis which was data from the Bacherse project which was really fun and if you have a data set and you want me to look at it you want me to make a little presentation about it then let me know as soon as possible because then I can still spend some time on making a really beautiful presentation for you guys otherwise I will pick and if I am going to pick then it's going to be on the exam and if you give me a data set or if you guys want to know very specific things like say I want to know about interwoven loop designs in microarrays or I am doing mathematics and I want to know how I can implement mathematical functions in R because I'm not good at translating math into R code or you are from political sciences and you want to analyze voting behavior or whatever like I don't care if you have a really interesting data set or something that interests you then feel free to send it to me and then I can make a short lecture it will probably be like a one and a half hour lecture and then we can do that after I get back from Rotterdam good so that was it for today this is not this is what we are going to do next week so I'm just going to keep that so I'm just going to very quickly scroll to the end because that's where the question slide is that was too fast alright so here we are question slide so if there's any questions please let me know and feel free to send me an email if you have a data set that you want me to look at and tell you some things about how I would analyze it and that would be awesome so for me for now thank you guys for joining thank you guys for watching for the chats for the compliment and thank you guys for the questions like an optimistic card thank you for the compliment about my speaking English and the way that I deliver information really like thanks that just made my day and yeah for me that's it so see you guys next time and next time we will be on the conference and I will tell you guys how to do an R package that will take like an hour and a half so it will be a very short lecture and then the week after we will have a data set lecture where I'm analyzing one of your data sets or one of your suggestions and then we will have the overview then we will have one or two weeks off and then it will be the exam so I hope everyone is able to register for the exam and if not then I have to start calling the pre-functs anyway so I'm going to put it on my agenda to start calling the pre-functs as well and then I will directly ask why some people are not able to register good but for me for this time thank you guys for watching you guys here it's always fun streaming and please feel free to like this stream it really helps out a lot and of course subscribe to the channel if you're not yet subscribed and I will see you guys next time so thank you for your time and see you next week same well not same place but same time alright