 think we should be live at least I hope we are a lot of setup again for today but I think everything should be good to go I hope people are here I hope at least someone's here so if you can hear me please throw a message in chat then I know that I am audible so let me check everything okay stream health seems to be good although I'm getting the audio bitrate errors again or it's not so much an error like YouTube says it's a suggestion but it doesn't seem to be a suggestion so if you can hear me then hello hello hello very good very good my moderator says that she can hear me so that's perfectly fine so I am audible right very good very good so then let's see me as well CS loud and clear Danny thank you thank you so cheers we still have like two to three minutes before we start so we can just wait a little bit until all of the students arrive and everyone's here and what a busy day what a busy day but I'm really looking forward to spending some time with you guys so very good invisible as well that's perfect so yeah yeah how do you like how do you guys like the drawing spend a lot of time coming up with something that would represent data so in the end I figured that it would be a truck carrying data like people entering the zoom meeting which is fine as well since not everyone has a proper YouTube account all right so four people that's not the amount of people that I am expecting for the lecture but of course it's being recorded so people can always watch it back later and I think the nice thing about watching it back later is that you can do it at like 200% speed which means that you only spend half of the time but I sound like a smurf that's one of the drawbacks of in 200% speed like I never realized that before that because I'm like recording the lectures and putting them on YouTube people can more or less do what they want right so 200% speed is a possibility so anyway anyway so are we excited about data is there anything that you guys really want to learn about data because I've prepared a whole bunch of slides but I'm always open for questions and suggestions so questions and suggestions are highly highly appreciated because I do think that like I'm not an office right I've been programming all my life since I was four years old and it's for you guys to learn so sometimes there's like really tiny really obvious things that I think for granted which you guys are struggling with and then let me know because one of these things that is really hard for me to judge is to judge what is hard for you guys so I hope that that people become a little bit more vocal I send around an email this morning to all of the students like saying that well I haven't had any questions so that means that I just assumed that everyone was able to do the assignments and then I got a couple of mails back with people that said oh no I got stuck here on the assignment which is perfectly fine but don't get stuck for a whole week get stuck for like 30 minutes and then just jump in your email and shoot me a question I generally answer back very very quickly so and don't feel like you're bothering me or something like that that's not the case I'm here for you guys to help you so that's the reason why we're doing this and in the end it's for you guys to learn how to program so it's better to ask questions now than to just not ask questions all right so it's two at least on my clock so eight out of 37 students that's one of the reasons why I want to do it in person because that means that I can just check attendance and see who's there so it's one of these advantages of doing it in person all right let's do the first slide right so assignments from last week let me know what you thought about the assignments were they too many were there too few were they too hard were they too easy right because for me they're all really easy because like I said been programming for a long time so I can I have a hard time judging what you guys find difficult but let's just look at my answers for lecture number two and for this I actually wanted to ask you guys a question how do you guys want to do it because last time I just showed you my answers and we went through them one by one but we can do it the other way around as well I can just do them live so I can just close my answers and then just go through the questions one by one a little too many for my taste okay so that that's good feedback and I know that there are too many in lecture number two but it's it's okay I think because you don't have to do them all like the most important ones are on the top and then the other ones are more to kind of have you guys apply what you learned but I can drop some and move some to lecture number three let me actually look at the lectures for today because today we also I have a bunch of lectures so a bunch of assignments although I think they're a little bit less than lecture number two good so I would love to do them live but I understand if it's lame for everyone else well I I don't know like we can do them live okay so on Zoom we got a question if it is possible I would ask to spend more time in the assignments four seven and eight actually more for an eight the seventh actually could solve it but I tried to flip the coin for n times and I could not have to loop really going on all right so I think that it would be good that I just get an empty notepad and then we just do them one by one right because we have some time and I think I only have like 40 ish slides for the lecture today and the lecture today is going to be really about programming like the first two lectures there's a lot of theory in there and I know that but I do the theory first or upfront so that when we need it then you guys already know and I don't have to spend the whole time then explaining the theory good so let's switch to notepad right so assignments so first things first we're just gonna say give me a header and I'm gonna save it so that I get answers live three dot are right so if I give them the r extension and code highlighting kicks in so I can say answers to assignment three and then it is copyrighted by Danny Arans well which is not really true because it's actually the how Berlin since they are paying me for it so generally I have first of a set working directory but I don't think that we need it yet because generally the first assignments the first two lectures they don't use any data right we're just generating random numbers and writing some for loops and while loop so it should be fine to do without setting your working director but I'm going to put it in just because we can and I'm going to comment it out so it's there in case we need it alright so let me get the assignments then so first assignments were about control structures right so the first question one a is generate a random number between zero and one using the run a function and store this well store this value in a variable with the name unknown alright so we need to define a variable called unknown so we're going to do that and then we're going to use the run a function we're going to generate one random number between zero and one right so it's it's how many I want minimum value maximum value and of course if we are in our we can also check we can also check the function right so if you don't know exactly how the run a function work just do question mark run if and look at the help file because the help file will give you an example and it will explain which parameters are there so variable unknown between zero and one so question one a solved so let me add hashtag one a so and then that's it of course we can do it in another way as well so we can just type run if one which will do the same thing because actually the default values are to generate a number and then we can store it an unknown like this right so we can use the forward arrow to assign to the right side we can use the backward arrow to assign to the left side all right so next question question 1b all right use an if statement and then use the cut or print function to print out either lower or higher if the variable is smaller than 0.5 or bigger than 0.5 all right so let's do that so I'm just gonna say well if unknown right because that's what I want to check smaller than 0.5 what do I want to do I want to cut so I want to write something to the screen and I'm going to say is lower than 0.5 slash new line because I'm using the cut function I have to include the new line character otherwise are just continues on the same line and I generally like knowing what the value was right so I'm just going to say paste and then paste the unknown value that we have and then combine that with is lower than 0.5 right so I'm going to check my brackets using the highlighting so it seems that this is closing the paste this is closing the cut and then of course in the if statement I want to use curly brackets to denote where it starts and ends and then I want to do an else branch and then I want to print the other way around so higher than 0.5 close the bracket right so let's save it let's quickly go to R and make sure that everything works right so going to go to R and I'm just going to copy paste my whole code in and then indeed it says 0.81 is higher than 0.5 which is true so it seems to work but of course I need to test both branches right because testing one branch is not going to be enough because I don't know for sure that the lower than statement works so I'm just gonna run the same code a couple of times to see yeah so we're already there so the lower branch also works because 0.17 is indeed lower all right so one C generate a uniform value between minus 10 and 30 round this to zero digits behind the comma and using an if statement check if the value you generated is between 0 and 10 inclusively if the value is not in this range throw a stop error all right so let's go back to notepad so again we want to use the uniform function so I'm going to say hashtag one C so what I'm going to do is I'm going to still use the unknown variable because why not we already have it defined so I'm going to say run if one number between minus 10 and 30 and I want to round it so I'm just going to use the round function I'm going to say comma 0 so no digits behind the comma right so just go to our check if it works right because we do want to make sure that it works so now the value of a known is 28 and now the value of a known is minus 5 so it seems to work right seems to be within the range that I want to alright so now we have to check if it is between 0 and 10 so let's go back to notepad and start coding so I'm now going to say if and then unknown is between 0 and 10 so it's larger than or larger than or equal to my keyboard layout switched again to be stupid German layout English layout unknown larger or equal to 10 and end because it needs to be both of them need to be true at the same time and unknown is smaller than or equal to 10 I want to do something so I want to alright so only if it's not the case do I want to throw a stop error so I could just do nothing but I'm just going to say cut between 0 and 10 and don't forget the new line and else I need to throw a stop error so I'm just going to say stop unknown is not in the range we want it to be slash new line and I'm going to paste right because I still want to know what value I drew so I'm going to say unknown and then I'm going to say unknown right so I'm just going to print the number and then of course we need to test it a couple of times to make sure that we hit both the if branch as well as the else branch to make sure that both of these work alright so let's go to our and in our we're just going to generate it so error unknown 23 is not in the range that we want it to be which is true 15 is also not minus one is also not so we're getting relatively unlucky with the numbers that we draw and in this case we see between 0 and 10 I forgot to actually print back the number itself so let's see what we actually drew so we drew a one so seems to work right 0 and 10 all right next question hashtag to a question to a use a for loop and sum up all the numbers from one to a thousand inclusively you can check if your answer is correct by comparing the result to the result of someone to a thousand right so that's kind of what we want so let's go back to notepad so I'm going to say I have a for some right so this will be my total this will store the the result of all of the summation initially I haven't added up any numbers so that means that the number will be 0 and then I'm going to say for x in one that's a bracket too much for x in one two a thousand what am I going to do well I'm going to say for some is for some plus x and then after we're done we're going to check for some itself and then we're also going to check some of one to a thousand right because these two should be equal so very basically I'm defining the variable which will store the total so the answer and then I'm just going to loop a thousand times every time x will get the value in this range so first one then two then three and I'm just going to add it and then I'm going to print it to the screen so let's go to r see if this works so seems to work so if I add up all of the numbers from one to a thousand the answer is going to be five hundred thousand five hundred and indeed that matches the result that we had or the expected result all right next one to be I think to be is doing the same but now with a while loop so I think that this is one of the hardest questions because while loops are difficult to reason about because you have to keep track of how many times we're iterating right and if you write the thing that you're checking wrong your while statement will continue on forever and ever so I'm just going to define a variable called while some right same as the four some but this is going to be the result from the while loop and I need to have my number so I need to because I need to keep track of how many times I go through the loop I need to define this myself so I'm just going to say this is x right so x initially is one so I can say while x is smaller than or equal to one thousand what do I want to do but I want to do the same as what I did before so I'm wanting to do while some is while some plus x and now I have to remember to increase x right because we now added x so we want to do it one thing for me I think that happened to be because my Arkansas froze yeah that's one of the biggest issues if you write your while loop wrong it will continue indefinitely if I would not if I would do nothing and I would just run it like this right so we just run this code then this would run forever and ever and ever right because x will always be one and nothing will change and since one is always smaller than or equal to a thousand it will just continue forever and then you have to quit it yourself but let's not do that so let's just say x equals x plus one right and let's use the proper equals sign because we already defined x all right and then of course we want to check while some and now we know that it needs to be five hundred thousand five hundred so I'm not going to add the additional sum like we did before so let's just see what's going to happen if it will run if it will stop go to our right and then let's see what happens and indeed the while some also ends up with five hundred thousand five hundred and of course you have the little stop button right so if you if you by chance happen to write the while loop wrong and it will take more than like 30 seconds just click the stop button to to stop the current computation so that's that's why what it's there for all right so there was question two question three create a for loop that does the following a hundred times all right and then there's three steps generate a random number use if else if else to check if the variable is lower higher or equal than 42 use cut to print one of the three statements replace x by the random number make sure you add a new line all right and then it says x is lower than 42 x is higher than 42 or 42 is the answer to life the universe and everything so of course that's a reference to the hitchhiker's guide of the galaxy so let's just do that right so I want to do something so question three so I want to do something a hundred times so I'm just going to say for x in one to a hundred right so everything within the brackets will be done a hundred times I'm not going to use x right because x is just a variable that I'm defining here but I need to define a variable to to have the for loop work right because I can't define for one in a hundred I have to say x in one in a hundred so what are we going to do we're going to generate a random number so I'm going to do run if I want to have one number from zero to a hundred and I want to round it down of course be with zero digits behind the comma and I'm going to store this in a variable called life or whatever you can you can choose your variable names but in this case since it's a hitchhiker's guide of the galaxy reference we just call it life like why not okay so then I want to say if life is smaller than 42 then what do I want to do I want to say cut paste right because I want to print back the number so I want to paste live comma is smaller than 42 slash new line slash new line right because go to the next line else if life is larger than 42 I can say cut paste life is larger than 42 and else right so and now I can use this because I checked exclusively right so if a number is not lower than 42 and a number is not higher than 42 it has to be equal to 42 so I don't have to do an else if life is is 42 because I know this implicitly right because the first statement checks if it's lower the second one checks if it's higher so if it's none of those it needs to be the same that's basic numerology so let's just print the statement and then in this case we can just say is the answer to life the universe and everything and of course I'm not going to use 42 I'm just going to use the variable itself all right so this is more or less how it looks like right so it's just a big for loop we don't use the variable that we define in the for loop we just use it to go and do something a hundred times and drawing a random number checking it checking it again and then of course if it's not higher if it's not lower than it has to be the same all right so let's see if we made no errors or any brackets or something like that so let's go to R and just see what happens when we run this a couple of times all right so let's start at the beginning we drew the number 28 which is smaller we draw the number 92 which is larger so we checked the if and the else if and then we're lucky because we drew 42 one time out of a hundred which is pretty unlucky generally you draw it like once or twice but at least we drew it once right so 42 good back to notepad question number four I think four was the one that someone wanted to have more time with okay so question number four one of my favorites because I like triangles don't know why it's one of my favorite shapes so if you like triangles let me know because I think a lot of people like triangles triangles are like a good shape don't know why but in computers like everything that is being rendered on a computer generally is rendered in kind of triangles so it's good to to know about triangles and like triangles some people like hexagons I don't know why people would like hexagons it's silly shape but I think it's because of CPG Gray who did a whole hexagons are the best the cons I think triangles are the best best angles whatever you want to call it all right so making the triangle so use a while or a for loop and the cut function to print out a triangle of hashtags having 12 lines each line should have one more hashtag than the previous line right so we start off with line number one having one hashtag line number two having two hashtags line number three having three hashtags right so the first thing that we have to realize is that the number of hashtags that we're printing is the same as the line that we're on circles yeah circles are perfect as well because they are literally perfect right they don't have any angles but they are infinite shapes hexagons are cool because of bees yeah yeah bees are cool hexagons are I don't know I don't know they're they're good right because it's like a shape to so it's like the circumference to the content of the thing and circles of course are the best because it's the least amount of area surrounding it and the most inside but hexagons are relatively stable but video cards they work with triangles so you can only draw a triangle on your video card and that's it so I like triangles a lot but back to the question right so we have to realize the first thing that you have to realize if I'm on line number eight I need to draw eight of these hashtags right so and we need to do something an X amount of time in this case 12 right because we want to have 12 lines so the easiest would be because we know how far we need to go we just use a for loop right so I'm going to say for and then the line that I'm currently on in 1 to 12 right and now I need to have as many hashtags as the line that I'm on or I can also say let's do it line dot n right so for line number so the we learned a function which can repeat something an X amount of time right which is the rep function so we can do rep and what do we want to repeat well we want to repeat the hashtag symbol how often do we want to repeat the hashtag symbol well we want to repeat it line dot n times right because if I'm on line one I want to have one hashtag if I'm on line seven I want to have seven so and these are then called hashes or whatever you want to call them and now the only thing that I have to do is print them to the screen right and then put a new line behind them so I'm just going to say cut and I'm going to say paste and what do I want to paste together well I want to paste the hashes together and then I'm going to add a new line at the end right so let's run this and see what happens so let's go to our and then we're just going to copy the code and I'm going to go to our and just copy the code in right so you see now that something annoying happens right because the annoying thing that happens here is is that we see that there are spaces in between and in the in the example there were no spaces but it already looks like a triangle so let's remove the spaces right so the spaces are because I'm using the paste function right so the paste function has a parameter which is called the separator and the default separator is a space so I just want to say well overwrite the default separator and say that my separator is nothing right so don't put a space put nothing in between so now if I would go to our and paste this in it would it should it should fix the spaces which it doesn't so that's strange so where do the spaces come from so the spaces come from the fact that the paste function is a very complex function so it can paste two things together but if you give it a vector like we saw in one of the first assignments where we gave the row names right so we did paste individual comma one to ten right and then you see that it actually makes ten individuals and this is a vector so if I have a vector and I try to paste a vector onto itself then that is actually not called separating them but that is called collapsing them so what you have to figure out right in this case you just have to read the help file of the paste function let me see if I can get that one in the correct window probably can so when we go to Firefox right it the the function description says concatenate vectors after converting to character right and then you see it as a separator this is a character string to separate the terms and then you have collapse an optional character string to separate the results so you have individual terms that you paste together those are separated by the separator but if you have a vector then the vector is collapsed together using the collapse parameter so in this case we also need to set the collapse parameter to being true to having having a having a value right so in this case we have to go back right we have to go to the hashies and say hashies is the repeat cut paste separator is this collapsing is also this right so if you get a vector collapse the vector on itself make a single string out of it and if you do that collapse them using nothing right so to get rid of all of the spaces all right so let's see how this works in R so go to R and print it out so now it works right so now we have no spaces we have a triangle and it's a length of 12 good so I hope this this this was not too hard and the idea behind not giving you a hint about this is that you guys should be learning to read the documentation right if something doesn't work as you expected read the documentation there's actually in I think in this case for the collapse there's probably an example right so here you see paste 0 and collapse and so to collapse the output into a single string pass the collapse argument right so you could just run all of the examples and then you could figure out that oh yeah no they have an example for when you want to paste a vector together right so just run the examples and then in the example it shows you that indeed yes you can use this additional hidden parameter to collapse things to get rid of additional spaces in there good so read the documentation because I can't tell you everything right in the end programming is discovery right you have to figure out where you want to go and then you go there and along the way you have to solve all kinds of little problems and these little problems you can solve by either going online and asking a question or you can solve them just by looking at the documentation 99% of the time good next question next question is question number five already so number five is string escaping so this is a little bit nonsensical to kind of do but let me just do it for you guys right because it's often the case that you want to print stuff to a file and then you have to use like the double quote or you have to use a single quote or a backslash or something else right so we're just gonna do them so this is question number five so let's do question number five and let me guys show you notepad so that you can kind of keep track of what I'm doing so I'm just gonna copy paste the text directly just to make sure so I'm just going to say cut and I'm going to put what I want to cut and I'm just going to make sure that I do it on the line right and now you can see that notepad plus plus is really helpful right it already shows me that okay so this in gray is within the string right because everything inside of a string variable or a character variable is colored gray and here this is black meaning that this is outside of the string why because here I have this this character which is not escaped right so I have to escape this one so just put a slash in front of it all right so then we go escaping stuff is great but I think slash and so slash always needs to be escaped the backslash itself needs to have to be escaped so put another backslash in front forward slash don't doesn't need escaping and then might be a nuisance right so the sentence doesn't end here because this is part of what we want to cut and then we continue you are correct but I think that slash T right so I have to put an extra slash in front of the slash to escape it slash B extra slash to escape it create more of a problem than a basic hashtag and actually this is a smart quote so just replace it by a slash quote all right so this should be then it right and then we can close the string here so let's see go to our see if it works so let's go to our see if it works and then let's compare it I say double point escaping so that's good is great but slash that's also fine nuisance then there's a new line which is what we wanted you are correct but I think the slash T and slash B create more of a problem than a basic thing perfect looks exactly like we had in the example text all right so one of the things that I kind of sneakily did right is I didn't add in my own new lines and this is one of these little things that you can do in R is that if you let me switch you guys to notepad right if you do like this right so the slash here right this slash is closed here but I'm giving an enter inside of the string and because I'm using cut it recognizes that this is a multi-line string sorry and because it's a multi-line string it will automatically include the new line for me so a little bit of a trick but you don't really have to worry about that often you just want to type the new line yourself just to be sure all right question number six question number six all right six a random variables set your random number generator seed to a value of your choice all right that's easy right so we can just say set dot seed the value of my choice is 42 you can pick any number that you want I always pick 42 that's my seed and I'm sticking with it use run if to generate a vector containing 15 random numbers between 0 and 10 and store it in a variable called random one all right let's do this random one is run if right yep to generate 15 random numbers between 0 and 10 so 15 random numbers between 0 and 10 and that's question one 6b already so 6a question 6b all right right so let's just make sure that this works so we go to notepad right we generate we type random one and then we set our seed back and then we generate random one again and then we type it again and then it should be exactly the same so yes every number in the first time matches the second time so we have repeatable randomness all right so next question use the round function to round your random numbers in random one right so let's go to here and round them down so let's say let's just do it like this 6b and c and i'm just going to put the round function around it saying no digits behind the comma right so let's rerun the code in r see if we didn't break anything so of course we now expect random one to start with nine and then the next one would be nine as well and random one so 993 and these kinds of what does set c do i don't get it okay so if i do run if right i generate a random number so if i generate a random run if just generate one so i generate 0.94 then i generate 0.97 0.11 so if i want to have a repeatable randomness right because often i want to be able to repeat my analysis and get the exact same result so if i'm using random numbers somewhere in the analysis then at some point i have to fix my random number generator right so the set seed is what does that for me right so if i do a set seed of the number 42 and then afterwards i generate a random number and i'm going to do this on one line so we can repeat it a couple of times then it will always generate the exact same number no matter how often i do it because i fix my seed so the first number that i draw or the numbers that i draw after setting my seed will always be the same so if i generate like five numbers they will always be the same funnily enough the first number will still be the same as before right and setting my seed again will generate the exact five numbers so in case that i want to write a test case for an algorithm that i've designed and this algorithm uses random numbers i need to know what is the output because otherwise i can't test it right if the output is more or less dependent on the random numbers that i input or that i use then i need to weigh to generate repeatable randomness and that is what set seed does repeatable randomness all right so let's go back to notepad so we have our random one vector which rounds it and now reset your seed to your number and generate a single random number using our norm all right so we're going to do six d all right so hashtag 6d so the number used for set seed can be literally any one number or how does the choice of number influence the randomness is repeated it doesn't it just gives you multiple starting points so setting your seed to 42 will lead to a lead to random numbers being this setting your seed to 500 will generate a different vector of random numbers from then on but you have to have multiple because otherwise it would be always random right so in this case your number is up to your choice 60 repeat your seed reset your seed to your number and generate a single random number using our norm okay so we're going to set the seed to 42 and now we're going to say our norm and we're going to generate one random number now repeat steps zero b and zero c uh that's that's six b and six c store the results in a variable random two all right so i'm just going to do six b and six c i'm just going to copy paste and i'm just going to store this in random two right and now the question is why is the content of random one and random two not equal to each other and what do you observe when looking at the sequence of numbers generated all right so this is a little bit tricky and it's a hard question because it's a question that requires some understanding of why we do this and and what's happening right so let's let's let's first look at random one right so random one is the number that we expect them to be 9938657177 right because that's the first and now we have random two random two and random two is actually 38657177 so it is different but it's not that different right because it's just shifted right it's it's this part here is exactly the same as this part here so what happened is is that when we called the rnorm function it took two random numbers from the random number generator so then when i do the same thing as that i did before it did not generate the 99 but it did generate the other numbers and then continued because i asked for 10 numbers right so what is happening internally is that all of these random functions like renif our norm our poisson all of these random number generators in r they share the same source of randomness right so they share the same seed which means that if i if i draw a random number using the rnorm function then i influence the random numbers being drawn after with the uniform function and of course you can imagine that that drawing a uniform number it takes only one one one bit of entropy but generating a normal number takes two bits of entropy because you have a random mean and a random standard deviation right while for a uniform number you just randomly are within a range so that's the idea here is that you can have repeatable randomness but this repeatable randomness is only guaranteed to be exactly the same as long as you don't draw any other if if you change your code and you add another draw of a random number before the random numbers that you that you want to be repeatable then this influences it and it just shifts and the shift in this case is always going to be two and why is it two that's because generating a random uniform number takes one degree of freedom while taking a random normally distributed number takes two degrees of freedom from the kind of random number generator to source so a little bit difficult also for the interpretation but i think it's good to realize that you can have repeatable randomness so if you ever write an algorithm for example a machine learning algorithm which uses kind of random numbers or other things and then you can still write a test case saying that if this is the input this is my seed then this should be the output because the random numbers are randomly generated but they are repeatable random good so that's seven that's six f let's go back to noteplats so six f is more or less an interpretation question so i'm not going to add that all right so functions okay so good we've done more or less the practicing with the if statements the for loops the while loops and now we're at the favorite part which is creating functions so question number seven i think this is one of the questions that yep so we're going to spend a little bit more time on this okay so create a function that returns the result of a coin flip so a coin can only land on its head or on its tail um so we're just going to have a function which is called flip coin right so we're going to say hashtag seven flip coin um flip coin is a function and it takes no input parameters and then i'm just going to say um our runif one number between zero and one and i'm going to round them down one digit behind the comma yeah i can do it better right we know let's let's keep with the runif thing right so i'm just going to round the number down let's check in r if this really generates what we want no zero digits behind the comma sorry zero digits right so this seems to more or less do what we want sometimes it generates zero sometimes it generates one so in r because array start from one i'd rather have a number which is one or two right so because then i can just create an array and select the first or the second one so let's do this and let's say instead of generating a number for one to zero or from zero to one let's generate a random number from one to two and then round it down so this will give me one or this will give me two all right so let's copy this in in the function so i'm going to say this is my index right so the index and then i'm going to say well i have two options i either have heads or i have tails and now use the index that i just threw to to take one of these two out and then return this to the user right and then this is going to be my function right so i'm going to draw a number which is going to be either one or two then i'm going to make a little vector containing two elements and then i'm going to say from this little vector take either the first or the second one which one we're going to take is based on the variable called index and index will either contain one or two all right so let's flip the coin a couple of times right so we're going to paste in the function and now when we want to call the function we need to do flip coin and then round brackets so flip coin flip coin flip coin so yeah it either generates heads or it generates tails can we actually prove that it's actually reliable right because a coin will half of the time fall on tails half of the time will fall on heads so of course we want to check that kind of a little bit right so just to make sure so i'm just going to say four x in one two let's draw it ten thousand times or a hundred thousand ten thousand times so we're just going to say flip coin and then i'm going to say i have a vector which is my flips which initially is empty so two my vector called flips at flip coin right so just very basic and you will see this all over again in r right so all over in r you see defining a new variable initially being empty having a for loop and then using the c function where you have the empty array at the beginning or everything that we did so far and then we add just a single new value to it and it's being assigned back to what we had pretty memory inefficient but it works really well all right so let's flip the coin ten thousand times and then make sure that it works right so now we have our flips right so those are my flips that i had and now what i can do is i can just make a table out of this to make sure that it's around 50 50 which which is true right there's going to be some variation because there are random random numbers in there so in this case five thousand six uh five thousand and six heads four thousand four hundred and ninety four tails so pretty random pretty balanced in a way right only six times did it fall more to heads than to tails but that's something that we might expect all right so that's our function and directly we have a little test for it so let's go to question number eight hashtag eight all right question number eight reuse the code you created an assignment for but now make a function uh we have a question on the zoom a question what if we need like seven flips or n flips n put in by the user um okay so sure then of course the function needs to have a parameter right n and then i'm just going to do this right because i already made my test which does the coin flip a couple of times so then i'm just going to say take this part out right and then say one flip and now i'm going to create a new variable called multi flip right which initially is empty and i'm just going to say four x in one two n what are we going to do well we're going to create an index we're going to flip the coin once and now we're going to remember so we're going to say multi flip is a combination of multi flip that we had and one flip that we just did and then of course we're now going to return multi flip so kind of a mixture between the testing of the function right and so just write a for loop around it if you want to do something n times just make a for loop which does something n times and of course you then have to create a vector so you have to make sure that you store the result so again same strategy initially create a variable which is empty do what you want to do and then just add it to the empty variable the one thing that you want to remember and then store it back into the to the thing that remembers everything so now of course our flip coin function will look a little bit different so let's delete the test because we now need a new test right so we now need a test called flip coin once flip coin 10 and then we say flip coin 50 just to test so let's copy paste this into r so when we go to r then we would have one flip 10 flips and 50 flips right so again the same thing if you need more stuff just put a for loop around it for is your friend especially if you know how often you want to do something with a while loop it's harder because then you have to figure out when you need to stop but for something like this it's not not too hard and you can of course you could have made a new function out of this as well right instead of what i did and butcher my original function i could have just created a function called flip coin and then i could have made a function called flip multi coin or multi flip would you please upload both the solution of the original question eight as well as the special solution sure sure uh yeah i did i not already upload them i think if you look on moodle then the original solution it might be a little bit different because you can do this like 10 tens of times let me actually check moodle for you guys i think i already uploaded the assignments or the answers to the assignments so that should be already there and of course i can do the multi flip as well just to check that it's there yeah so answers two more introduction should be there flip coin flip coin oh i i called it yeah so i did it completely different in the original answers so in the original answers i did this so this is what i had in in the answers online right so here i just drew my random number and then i checked if it's smaller or if it's larger and then i actually had a special case for when it's on its side which might be interesting because of course a coin can fall on its side as well so probably as a joke i i just made it so that it can be heads it can be tails or it can be on its side um but yeah i will i will update both of them and and add some to it but in this case you can be really creative there's actually a special function for what we're doing here right because we're generating an index and then using that we don't have to do that in r there's a function called sample so i can just say sample from heads or tails one element and this will do the exact same thing as the whole function that we just wrote right so it generates the index and then takes one of the two out but the sample function is just a little bit easier because it's just a single line where you can sample and the nice thing is is now you can also sample 10 right but then you have to say replace is true because of course when you have only two elements you can't take 10 of them out but if you say that replacement is true then you can actually have it multi-flip in one one line of code but of course here the idea is is that you guys work with the things that you have learned um so it's uh you don't need an else for on its side new no because again like in the question number 42 um because in the function um let me go back to notepad yeah let me go back to notepad right so um the original answer let me just copy it from Moodle again like this because of the return here it bales on the function so if it is smaller than 0.5 it will directly return heads and stop the function if it is not this then it will check this and if this is true then it will directly return so it will not even try to execute this line but of course if this one is not true and this one is not true then the number we drew was exactly 0.5 and then only then will it come to here so in this case because we are returning right returning means throw the box out of the factory and the whole factory kind of shuts down directly so there's it it doesn't even look at this line of code so if r and d is smaller than 0.5 then these two lines of code are completely ignored because of the return so the return is is a very powerful control structure because it kind of bales on the current function you give back the box and everything shuts down and is cleaned up so there's no reason to uh um to to clean that up you don't need an L no we don't need an L and that that's also why we don't need any brackets because we directly return so you could do something like this but that's not necessary all right enough flipping coins we've seen four different ways of doing it now and there's many different ways this is not the only two ways that you can flip a coin that that that's that there's more so there's even more ways of doing this but the sample is very elegant right so sampling is really useful also when you want to take like a subset of individuals and the replace function also makes it really nice so that you can do it in one line of one line of code youth um ah question number eight so reuse the code you created in assignment four but now make it a function called triangle that prints a triangle of which size of which the size can be specified by the user oh my god that's such a like crooked sentence like why did nobody mail me that like mail me like your assignment is just written like someone who just learned english for the first time and there's like a prince and there's even a double n so yeah guys please if you do the assignments spell check me because like you're helping not just yourself but also the people afterwards so let me directly fix that since i have the original ones here that prints a triangle of which the size can be specified by the user this is like obtuse language plus plus uh the function signature will look something like here size is the parameter that lets the user specify the number of rows in the triangle all right so bearing the very very poor spelling of the of the question um this is the example code that you got right so this was the code which was in the in the assignment so we want to reuse the triangle code right so the first thing about code reusing is that it's copy paste so we're just going to go to the triangle right and we're just going to copy paste it so we're just going to go see we're going to go into the function and we're just gonna do blam there's our code right so the code for the triangle was make a for loop we have to realize that the number of triangles that we need to print every line are the same as the line that we're on right so i now have my function called triangle i copy pasted my code in i know that this code works for a triangle of size 12 so of course the first thing that i'm going to do is just say well instead of going to 12 go to the size that the user specified just say size all right so let's try this right so let's just be naive and say make a triangle of size five and then make a triangle of size 15 and then make a triangle of size um 20 or 29 that's that's perfectly fine all right so let's go to r so code reuse is is the best reuse because we just copy paste it so we go to r and then we see so first triangle 1 2 3 4 5 and then this is definitely going to be 15 and this is definitely going to be 29 so looks pretty good looks pretty good so minimal code change right the only thing that we changed is the end of the for loop um and it seems to work pretty well and of course this function has still some things that we might want to fix right because size is like no one tells you that the size can't be negative right negative 12 do we want to start off with 12 and then go all the way down um so that's kind of the idea it's like half of a pine yeah yeah yeah yeah there um there's actually originally there was actually another question which came back in like lecture number five or six where you had to reuse the triangle code again so the function that you made to make a diamond so to to make four of these triangles flip them and then make them into a diamond shape which is also possible and and also here if the size is minus 12 should we start off very wide and then go all the way back to zero right that would be fun so there's there's many different ways that you can kind of interpret this um but you can also just throw a stop error all right perfect so that's question number eight there were a lot of questions i do agree i almost spend an hour on this and i'm i'm i'm thinking that yeah we might want to drop one or two so all right so question number nine create a function that calculates the factorial so five factorial is five times four times three times one of a given number the function signature should look like this here x is the function parameter that represents holds the input by the user all right so more or less the same as that we had before so let me copy paste this in so first i'm going to say hashtag nine and then hashtag nine so this is the example code right so let's compute a factorial so we can do this in two ways so we can use the the more or less default way which i like um or we could be really smart and start using recursion i told you guys about Ada Loveless and and Charles Babbage and that she's the inventor of recursion and recursion is very very powerful um so i'm going to do this one twice i'm going to once just use a for loop and i'm going to once use recursion so first let's use the for loop right so again i'm just going to have something which will store the result um like the while sum or the for some and i'm just going to say um yes right so my result is going to be one and not zero and why is it going to be one because i cannot multiply a number by zero because anything multiplied by zero will be zero right so if if i say rest times one it needs to be one right so if i would put rest to zero then it wouldn't work because then zero times one would be zero and it will continue to be zero so i'm just going to say my result is going to be one because no matter what i do i'm multiplying numbers these numbers are going to be higher than zero because i cannot take the factorial of zero well in theory you could but like zero is not in the list and of a factorial so i'm just going to say rest equals one so i'm going to say four um i in one to x right because x is the user is the number that the user gives me and then i is going to be my current number or i can call it cn for current number right let's call it cn so what do i want so i want to say rest is rest multiplied by current number and then i'm going to return rest in the end all right so let's check this um so let's take a couple of easy examples my factorial one is going to be one two is going to be two and three should be six right and then four should be six times four it's already going to be hard 12 24 so let's see if we get the answers that we expect right when we go to r all right so basic factorial function version number one so that seems to be correct right so one times one is one one times two is two three times two times one is three times two is six four times three is 12 times two is 24 and so on so this seems to work so this is an iterative version right we are using an iterator so we're going to go through the numbers one by one and of course i can i can take a larger number as well i can take like 900 and then we'll say this is infinite right because it very quickly explodes to being a very very big number um so i think like a hundred is still possible but then you already see that it's a number times 10 to the power of 157 which is like insanely big um so let's use recursion because i love recursion and i just want to show you guys how nice it is right so recursion is going to be the same right so hashtag uh recur recursive uh factorial right because it's very close to the mathematical definition and that's what i like about recursion so again we have a function of x right so now i'm going to do slightly something different because i'm going to say if x is is one i know what's going to be the answer right because if x is one then i return the value one because factorial of one is one one times one is one so that's what i want to do and then i can say okay so if x is one return one and now i'm in my else branch right because i directly return so if if if i get here at line 125 it means that x is different from one it should be higher than one because that's one of these things in in recursion so we have to make sure that we that we check that as well but so if x is not one what do i want to return well i want to return x times the factorial of x minus one because that's how it's defined right because like five factorial is actually five times four factorial and four factorial is actually four times three factorial and three factorial is actually three times two factorial right so and this is a really really elegant solution because it looks really elegant right it's two lines of code you have a base case and then you have a very basically calling the exact same function oh my factorial sorry my factorial so we call the same function but now with the parameter but one less and of course if i take any number then doing minus one will of course always go and hit one base case all right so let's try this as well right so make this new function and then we're just going to call it four times with different values to see if it really works and really does what we expected to do all right so recursive very recursive factorial um and then we see it works really well right and the nice thing is it is actually much more efficient this way memory-wise so calculating um like my factorial 100 will of course give you the exact same answer but it will give it it will give you this answer with less computation in a way because it's much more efficient because it doesn't have to have a for loop or an internal number or whatever no it just says well i'm calling myself again this time with with the number one one smaller of course this is really hard to realize so recursive functions are very close to the mathematical definition of the function itself but you have to you have to kind of work for getting them and they don't come easy so it's it's not it's not wrong to write it like this in python recursion is very slow is this different in r in fact recursion should actually be much much faster even in python so if there's something that's slowing it down then it's either because you're running out of memory but a recursive call should always be faster than a for loop and if that's not the case then there's something wrong in the programming language because that that should never be the case why don't simply use the factorial function itself we could it's kind of cheating but of course this would be a perfectly valid answer as well right so you could say my factorial x what do we want to do we want to return factorial x right but like just when using it for fibonacci um it shouldn't be shouldn't be um then there's something wrong there there there's definitely something wrong because recursion because you're using stack memory and not um not run a run memory um it should be much quicker but it like i know this from using things like c and c plus plus i don't know exactly how it is in python because python and r are both interpreted languages so they have some additional overhead and it could be that it's not being optimized by the interpreter because of the some overhead but i know that for example when you're using c or c plus plus it it it works and yes you can use the factorial function as well but then you don't learn anything right because that well it's not wrong it's just not the answer that we're looking for because we want to do the stuff ourselves and not use the factorial function that someone else wrote good um had do we have another question or do we have a break time now oh we have some additional assignments okay so do we want to do the additional assignments because generally i don't do them because they are additional they are for you guys um that's also why you have the little piece of text saying that your candle fades as you walk into darkness suddenly you realize you are on your own right the idea is is that these things are just additional for you guys to practice um and i think that the answers are on moodle um and yes they are so they they are on moodle but uh if if you want we can do them now if not then we'll switch to the first break um and then we continue with the lecture afterwards because um i don't think that that they're that important it's something that you guys can um work on right and especially extra two is really really good i love the extra two because it's something that was always asked in elementary school to kind of figure out how much math skills children had so it's a very common test for for how smart a child is all right so if there's no opinions one way or the other there's no one screaming please do the additional assignments as well then i think we're just going to take a break and um i will be back after the break and in the meantime we're going to have animated gifts of course like we have every week i just forgot which animal i chose i really don't know anymore i did it this morning and then i had a meeting so i completely forgot which animal i chose i think it's going to be goats something inside of me yells goats so again one of these very very tasty animals um and i will see you guys in around seven to 10 minutes uh and in the meantime take some coffee take a little break and then we will just do um the lecture so like i told you guys lecture will be relatively short probably because it's only like 40 something slides so unless you guys have a lot of questions which of course is perfectly fine um poor people cows yes yes poor people cows that's uh that's a kind of running joke in our department uh we have one colleague who works on goats and she always calls goats the poor people cow and we also have people working on on cows in our group and i always want them to say that cows are the rich people goat but they don't do that they never do that i should convince paula to do that to to do a presentation about cows and say cows are the rich people goat so but anyway um running joke in the department so let's uh let's do the break um so let me switch on the sound for you guys then we go to music um then we start our music let's do reactor oh that's a very short one but that's good so and then we go to the break so i will see you guys in five to ten minutes so enjoy the break um i think it's goats so enjoy the goats very short break very short break five minutes guys that was a rush eat me i'm a sheep well well sheep are cool as well i don't think i have sheeps uh let me actually check my uh my streaming folder and then my break gifts do i have sheeps no i don't have sheeps so that's a good animal to put on the list for a future one all right um let's do the lecture right if there's no additional questions to the assignments then um i hope everyone was able to get through it um there is a lot um i agree it's a lot but it's something that you have to practice right because practice makes perfect especially in programming you can't learn programming without having a laptop having an empty notepad plus plus window or an empty editor and just starting to type good so what are we going to do today um so on moodle you will find some free books and um definitely download them um even if you're not going to use them now you're going to need them in the future um and they were free due to the SARS cough 2 pandemic um but i don't know if they're still free um but if they are then people who are not signed into moodle can get them from the following links and of course i will put the presentation on my website so there's a link down in the description where you can find links to the the power points and to the assignments and then you can also find a link to the books by clicking on them i think they're also clickable in the in the pdf version good so for today we are going to read data everything today is going to be about data so how are we going to read things like tables text files binary files and massive amount of massive text files right so for example dna sequencing data is a big text file with nothing but like 75 to 150 base pair reads in them and millions and millions and millions of them which one is the best book uh depends a little bit on your level um if you are a really really novice then the first one is the best one but if you if you know statistics and you know a little bit of r then the middle one is the best one if you don't know statistics but you know a little bit of r then the last one is the best one but they're all three pretty good and they're all three right relatively relevant so for today reading data um compressed files because that's one of the advantages of r is that you can directly load in data which is compressed using for example a zip file or using a tar dot gz compression which means that you don't have to have massive files on your hard drive uncompressed but you can also compress files and then directly use them in r and then I want to say a little bit because once you have your data in r what are you going to do with it so I want to introduce three or two new functions three new functions more or less so the which function the in function and the subset function to kind of manage your data and make subsets we already saw a couple of these um or yeah we I think we saw the which function already which turns a logical factor into an index factor but I want to tell a little bit more about them because they are relatively they they really fit to the to the topic and then of course when we're reading data we also want to write out data so the second part of the lecture will probably be about managing your data and saving your data and I put in a couple of slides about bio mart because bio mart is one of these packages in r which is really useful for people that study biology and since we are at a biology department um I think bio mart is something that people will enjoy because it really helps you to um to get data from the ensemble database which is more or less the de facto standard database for biological and bioinformatics research um so I do think that it makes sense to to also talk about how to get data from bio mart all right so a quick recap of the lectures that we already had so in lecture one we used r as a calculator we discussed the different types of data um and how to index them and we also discussed things like creating a sequence repeating something and going from one to 20 and then in lecture two we talked about variables and control structures and functions and brackets and how to escape stuff um and randomness so drawing random numbers and setting your seed and I had a few words about clean and reusable code and fortunately the assignments that I got from the students because some people mailed me their answers because they got stuck somewhere they already start looking a lot cleaner like um so that's that's really good so people are listening and saying yes no we should definitely code like a professional and not just throw everything in r and then copy paste it out of r so like clean and reusable code write it in a text editor make sure you use proper indentation and these kinds of things so that's that's kind of important because it will help you nicely structured code it's easy to read easy to understand and allows you to reason about your own code so in r and data there's many different sources from which we can get data so random numbers we already discussed that right and random numbers are a source of data and are really useful if you want to just have a little bit of an example data set and have not collected your own data yet one other source of data is r packages many many r packages come with example data and some of them actually come with very valuable data sets so there is a lot of data available in r so I want to show you guys how you can use that data which is freely available and then of course there are files on your hard drive or on a usb stick or wherever you store your files and these files are also an input source of data and had these can be comma separated files but also microsoft excel files or tab separated files and even images right images are also nothing more than a certain representation of data so all of these things we will go through and I will show you some examples on how you can use r to interpret this data and load it in and do things with it and for the more of course you can get data online so for example you can scrape web pages or you can download data from ftp servers and you can do this directly in r so you don't have to go to the website click on the ftp side then download the file that you want r can kind of do that for you which has as an added advantage that for example if you are interested in for example modeling stock prices or modeling bitcoin or whatever and things that change very frequently on a daily basis then of course it doesn't make sense to download a data set which has the current prices no you can directly go to for example google financial and download all of the financial data which is up to date and that is a really useful feature because that means that if you run the same script tomorrow it will have the latest and up-to-date numbers without you having to go to the website and clicking on it and downloading it again all right so random numbers very nice source of data so we discussed this in lecture two so there's a uniform distribution a Gaussian distribution and a Poisson distribution and here we see a nice Dilbert cartoon about random number generators just like we had the xkcd last time so random numbers good source of data very good if you have not collected your own data but just want to get some input data which has a structure similar to the data that you're going to work with i told you guys that data is also available in our packages so have a lot of our packages that are out there they contain example data and you can load them with the data function and the nice thing is is that if you if you just start up r and throw in this command then it will show you all of the data sets and that is really useful because there's literally like hundreds and hundreds of free data sets in r that you can use and that have been collected by other people so normally what i do is i load in a data set using the data right so for example the us arrest data which is data on the united states of the america where they collected murder statistics assault statistics and then the urban population and also statistics on rape so it's just some some historical data which you can use to kind of figure out if there's a correlation between the amount of murders and the amount of rape in certain states in the us but loading it just is data and then the name of the data set and if you want to see like the top five or top ten lines to get an idea of which columns are in there you can use the head function so the head function shows you the first five to ten lines of a certain data set so head function really useful there's also a tail function if you want to see like the bottom five rows which sometimes is really useful when you load in a data set and you want to see if it really loaded in the whole thing but the head function is useful first five lines and it will try to fit it on the on the R window so that it doesn't start looping around and these kinds of nonsense right because if you just type us arrest it would print all 51 states to your R console and you have to scroll up to see the the column names and the first row names all right so reading from files is relatively easy especially when there are structured files so if you have a structured tabulated file like a csv like a comma separated file or a tab separated file you can use the read table and the read csv function so as a personal preference i always use the read csv function because there are some issues with the read table function and it sometimes stops halfway through and that's really hard to detect and the read csv function just is seems to be a lot quicker in loading in data and they do the same thing and they have the same parameters as well so if you have a choice i would advise you to use the read.csv function right so it loads in tabular data using a certain separator and you can interpret missing values using the na.strings so for example if you get data sets from other people other collaborators often different people use different values for missing values sometimes they use nothing sometimes it's a dot sometimes it's an x sometimes they use a minus sign some people use three minus signs for missing data so there are a lot of different ways that people encode missing values and you can set all of these using na.strings so this is just one of the parameters in the read table and in the read csv function which allows you to say these values should be interpreted as missing of course it supports headers and row names so you can specify if there is a header by saying for example header is true so if there are if the first line of the file is not data but more or less the the column names then header is true will treat the first line of the file as the column names and row names is a little bit special because row names you can just specify the number so by specifying row names is one you're saying that the the names that i want to use as the rows for my matrix or the table that i'm reading in are in the first column you can also say row names is two then it will use the second column as the row names and very often these functions don't return a matrix but they return a data frame and this is of course because data that you load in in general different columns have different types and as we remember from lecture number one a matrix is something which in which all columns have the same type like a numeric matrix or a character matrix while a data frame allows every column to have its own data type so the read table and the read csv functions they try to give you back a matrix but very often they just give you back a data frame because they cannot figure out exactly which type is in which column or which column needs to be which type so if you look at the documentation of the read table function you see that it has many many many many parameters that you can fill in and i don't want to discuss all of them and you you do need to know all of them if you want to load in all of your data exactly properly but the main ones are these ones so it's the header the separator the quote the row names the column names the n a dot strings the call classes the check names and the strings as factor good so let's go through all of these right so when we're reading data i already told you guys the header defines if we have column names so this is a value which can be true or false we then have the quote character so very often csv or tab separated files they use a certain quote character um na is not of it no na is missing so indeed not available but in in r we call this a missing value which is different than an na n which is not a number so yeah na is not available but more or less missing because that's how r interprets it all right so the quote character is very often used when people save tab separated files for example if you save a top separated file from excel what it will do it will take every entry in the column right and then put a quote around it so sometimes this is like a double air quote sometimes it's a single air quote and it can there can be different quote characters um so if there is a quote character used in the file you can tell r that stuff is quoted so don't read in the quotes right because you don't want the quotes to be part of the data because they're there to signify the beginning and the end of an entry in the file then you have a fun or then you have a parameter called strings as factors and this is a very dangerous setting because it it tells r that if you encounter a character column turn it into a factor and the default in r is true and in my opinion the default should be false because factors have a very specific meaning right a factor is a statistical thing right it is male female right or it is young medium or old like they are groups and generally the data that you load in are not factors and by setting strings as factor to true every character column that it encounters it will it will make a factor out of it and it does this to save memory but it leads to issues when you start doing statistics and you assume that this this is a character vector right so but the default value is true so be aware that if you read in your data in r like your character data from a matrix using the retable it might automatically transform it to factors so generally when i'm loading in files i always set this to false because i am the programmer so i will decide when character data is going to be turned into factors not only am i going to decide i also always want to decide myself what the base case is right do i want to compare females to males or do i want to compare males to females and by saying strings as factors is true r makes this decision for you because the first thing that it sees in this column will be the base case and that might not be the real base case that you want and this is especially interesting when you start talking about DNA data right where you have like a reference genome and you have an alternative allele right because you always want to be have the reference genome or the reference allele to be the base case and you want to report for example the effect size relative to this base case so in this case strings as factor the default is true in my opinion the default should be false but just be aware that sometimes r tries to be smart tries to save memory but in the meantime it kind of gobbles up your data and transforms it to factors while you actually want to have it as characters all right so the separator is the separator character used in the file so r tries to be smart when you use the retable function and takes the character that occurs the most as being the separator but this might not always be the best choice so you you can set the separator to be for example comma or you can set the separator to be tab or you can set the separator to be space row names we also discussed so row names can be a single number which tells r the column of the table that contains the row names but you can also give a vector so if you give row names a vector then this vector that you supply will be used as the row names so for example if you have a matrix stored on the hard drive and this matrix does not have row names you can supply your own row names while you are loading in the table and you can do that as a vector column names a vector giving the column names so of course when you provide column names the header is set to false right so you can say header is true if header is true then it will treat the first row in the file as the column names if you set header to false then you can give your own column names the way same way as that you give the row names so just give it a vector and this this vector will be used then we have the check names function and the check names function is there because in r when you're loading in a matrix r assumes that every column name is a proper variable name that means that a column name in r is actually not allowed to start with a number right because for example if you have a column called 10 days so 10 days then 10 days is not a proper variable name in r and we will get back to this why these check names exist and this is because of the which function or with the with function and that that takes a matrix and then every column is turned into a variable automatically and that's why there's this check names function so if you load in your data and your header so your column names of the matrix are being changed by r then this is because of the check names function so it recognizes that some names are not proper variable names in r and then it will put an x in front of it or it will put a v in front of it or it will change dots by underscores so a really annoying parameter but it is there because r assumes that column names can be turned into variables just like it assumes that row names are always unique and they should be always unique so that's the reason why we have this check names and then we have the skip function so very often when you download data from a source it starts off with some header information right so this data was collected by denny denny did this and this and this and then there's all kinds of information about the data that's going to follow and of course you need to skip this right because when you're loading in a table then r assumes that the first row that it's reading in is the are the column names and if there's then stuff on top then you can skip that by saying skip is five then it will ignore the first five lines of the file and then the sixth line of the file will be the column names and the seventh line of the file will contain the data that we are going to load in call classes this one is really really important so call classes define what class we expect in the different columns right so imagine that we have a data set in excel that looks like this or i show it in excel because that's just nice or aligned but imagine that this would be a text file then we can then we need to specify for each column how we want to interpret it right so in this case let's go through it together right so and the first one is called ensemble gene id so this of course is going to be the row names right and row names are generally characters and because this is a character value and it starts with e n s m e n s m u s g and so these are character values which we can load in and can manipulate afterwards if we wanted to but in this case we have to tell r that these are characters then the next column here is chromosome right and chromosomes are biological objects so they are kind of a grouping factor so they are kind of factorials so they are real factorials right so in this case if i would load it in and i would set this column to be loaded in as factor then it will look how many unique elements do i find in this column and then each unique element will be recoded relative to the first element right so that means that chromosome three in this case in this file will be the default case and every effect that it will give you when you do statistics will be chromosome x relative to chromosome three or chromosome 16 relative to chromosome three so in this case chromosomes because people normally think about chromosomes as numerical values that is not entirely true because they are kind of grouping factors and of course we also have non-numerical chromosome names like chromosome x or chromosome z or chromosome y or chromosome w those are all valid chromosome names so chromosome names generally are factors start and end positions are of course numerical values the strand is again a factor right because you can only be on the forward strand or on the reverse strand of dna um we have the mgi id which is a character the mgi symbol also a character and we have the mgi description which is also character right so in this case we would specify call classes is character for the first column factor for the second column numeric for the third column and so on so we explicitly tell our how to interpret each of the columns of the matrix and this is annoying to do but it is important because it it saves you a lot of headache by our interpreting your end positions as being characters right and then all of a sudden you want to do mathematics right you want to calculate the middle right so start plus end divided by two and if you don't put both columns to be numeric then r might be trying to be smart and say well no this is a character and then all of a sudden you can't calculate means anymore because you can't calculate the mean when one of the values is a character so reading in data is very often a trial and error process so the way that i do it is i open up my file in a text editor to see the content and then it takes me multiple tries to load in the data correctly right and i often use the the head function to look at my data or i do something like this right so i i try reading the table saying separator is comma this file is what i want to read store it in m data and then show me the first 10 lines the dim function is there to ask for the dimensions and always ask for the dimensions and check that the number of lines in the file is the same as the number of lines that are read in because it sometimes happens that it quits halfway through because of a special character being in your file or something else which r doesn't understand and when it doesn't understand it just stops reading and you only get half of the data set loaded and that is that is a very big problem and so just it takes multiple tries to do these kinds of things generally and it it's just a process that you have to go through every time that you get a new data set and one of the things is figuring out which are the missing values or what people use as being missing values all right so when we talk about files just unstructured text files for example when you download the bible in txt format head then then you can use the read lines function to just load in the text file so it loads in text files on a line by line basis so it's very suitable for for example files which are txt or csv or fasta files or vcf files so you can use it to load in any type of file when there is no matrix structure in there so have for example if i just want to read the first 10 lines of a file i can say read lines of my text dot txt n equals 10 if i want to read in everything until the end i can say read lines the file name that i want to read in and then i'm going to say n is minus one and then that means read everything to the end so when you give this function a string as the first parameter it will interpret this as a file name so it will open the file read the number of lines that you want and then it will close it afterwards right because normally in programming languages you have to open up a file read from it and then close it but this is done inside of the read lines function when you give it a string parameter so when you give it a character as the first argument so you can of course use r to read a text file line by line because often you want to go through a file do something with a line and then move on to the next line and you don't want to remember what you did with the next line yeah so this is very useful when you want to process one line at a time for example fusta files where you have a description then the the code or the the dna code then again a description more dna sequences more description more dna sequences right and these files can be huge gigabytes at a time so if you just say read lines my big fusta file comma n equals minus one it will load in data for like 15 minutes and then it will stop and say i am out of memory right so to prevent this big amount of data are often stored in a roll wise fashion things like sequencing read or the sequences of genes or a single nucleotide polymorphisms they are all stored in massive massive lists which contain hundreds of thousands of entries so read lines can be used question how to check data size outside r if you want to know whether r somehow chops off some data so in many operating systems you can just ask how many lines there are in a file by the function wc so word count and of course you can open up your file in a text editor and just look in the text editor how much many lines there are of course this sometimes is not possible because many text editors also try to load in the whole file which is a little bit silly because you only have to load in what you show to the user so to check the data size to what you need to do is you need to you need to make sure that the amount of columns that r reads in is the same as the amount of columns in your in your text file and furthermore you need to make sure that the amount of lines it reads in so the amount of rows is the same as the amount of lines in your text file so you can use in in linux you can use something like wc minus l and then your file name dot txt and then this will tell you how many lines there are in the file and this this also works for very very big files and i think it's the same in windows i think windows also has a wc word count and in this case word count minus l means don't count words but count lines and then the name of the file so and in r you can actually read in big files line by line to make sure that you don't run out of memory so in in in the end you might have a gigabyte file on your hard drive and you want to just process every line one by one because every line is a sequencing read that you for example want to align to a genome so how can you do that um so how we can do that by using something called a connection so this is relatively advanced again i'm giving you this code now it's lecture number three but i'm giving you this code now because you are going to need it in the future in the current assignments you're not going to need to do connections because the files that we are going to load in are relatively small but of course imagine that some of these files can be like gigabytes big so connections when you make a connection you have to tell our how you're going to use the connection are you going to read from a connection are you going to write to a connection are you going to append so append to the bottom of the file um r plus means opens a file for reading but allow me to write and append to it as well right plus a w plus means open a file for writing but allow me to read and append to it as well and a plus means open a file for appending but also allow me to read and write to it and of course these three are equivalent to each other because it allows you to do anything with a file it's just that r will optimize reading writing or appending for you so of course if you open up a connection to a file in read plus mode and then the only thing that you're going to do is writing to it then you should have used w plus because that would have been much faster so it's just an optimization to specify read plus write plus or append plus it's just a hint to r so that r knows this is the thing that the guy is wanting to do with the file so if you want to read through a big text file right so we're just going to read the first thing that i'm going to say is i want to track the line number that i'm currently on so line dot n is one is means that i'm we're starting at the first line then i'm opening up my connection and opening up a connection is done by using the file command so the file command takes the name of the file that you want to open and then the second parameter is how you want to open it right so here i'm saying open this file in read only mode and then it gives you back and it gives you back a connection object so this connection object we store in t file you can give it any variable name that you want but in this case we call it t file for text file and then there's this magic incantation so while the length of the line is read lines t file n is one greater than zero right so what this does it calls the read line function on the connection reading a single line storing this in line so in the variable line so this will create a variable called line holding a single line from the text file and then what are we going to do we're going to ask if the length of this line is greater than zero because if the length of this line is greater than zero then we have read a line of the file the only time that the length of the line will be zero is when we are at the end of the file so this will just read a line into the line variable and check if we're not at the end of the file what we're going to do is we're just going to cut this line to the screen and then we're going to do line n is line n plus one so what this does it just prints the whole file to the screen and in the meantime it keeps track of how many lines it read in so line is line plus one and of course because we're now using connections after we're done with reading from the file we have to close the file and this is very important because otherwise R will issue a warning saying that you forgot to close your file this is not bad it's not bad to forget to close your file but remember that there is only a limited amount of files that can be open at a single time in R so if you open up like 1025 files then R will all of a sudden give you an error saying that I cannot open this file because I'm out of file descriptors I'm out of file connections more or less so this is just reading through a file line by line there will be an example but just copy paste the code and edit what you want to do right because here you now have access to the line so you can do anything that you want and for example you can split the line or you can you can interpret the line in a certain way or you can like take the first 20 characters and forget about the rest right but that's something that you can do within the while loop and this magic incantation just means read a single line put it in the line variable and then check if we're not at the end of the file if so do this and otherwise close the file good so I told you that R can also read archive data we can directly use a tar.js file so this is a tar.js zip file or we can use a zip file itself as a connection so it allows us to read directly from the archive so we don't have to go into explorer right click extract here and use a lot of this space and this is really useful because it saves a lot of this space if you have a very big text file and you want to use it then in R just first zip it because it will save you like almost probably like 80 percent of your disk space so how do we do this well reading a gz file is the same as reading a text file but now we have to use the gz file so if we have a file which is gzipped which is a compression format then we can just make a new connection saying gz file name of the file open it for reading call it t file we have the same magic incantation just going through every line one by one and then we close it at the end so exactly the same as before the only thing that is different is instead of using file we now use gz file we can do the same thing with the web so if we want to go to for example google.com and we want to read what the google web server sends back to us then we can read directly from websites using the url function so again url is a connection so we say url to where we want to go in this case we cannot specify read write or append because you can only read from an url and then we can just say read lines my url n is one this will read the first line in this case you can also use n is minus one and that will read all of the text on the website and then put it in line or in your variable of course here it is very important to close the url and also here you have to be careful to not get yourself banned from google or from yahoo or from facebook read the terms of servers of the web server that you are trying to connect to and make sure that it is allowed to automatically with the program connect to their web server the people at facebook are not going to be happy if you are going to make thousands of connections to their website and it will result in a ban so read the terms of servers for the website that you are connecting to in case you want to read from a website because it's not considered hacking but it's considered very bad practice to use a for loop to read like 5000 facebook profiles at once even if the profiles are public you're not allowed to do that and they will ban you they will ban your ip address and they won't let you back in so don't do this actually don't connect to google.com directly they actually have an appy for that so you have to connect to httpsappy.google.com if you want to do that so don't execute this well you can execute the code like once or twice they won't ban you for like one or two connections but if you start making hundreds of connections google will be mad at you and they will not allow you to search for at least a week on their web server so if you have your data in excel then you're kind of screwed because excel is a very very bad program to store data unless it's financial data because like excel likes to eat your data and eating up like real biological data or or data that you collected during an experiment is bad but apparently when you are working in the financial sector then no one really cares about that the big problem with excel files is that there is no native support for reading excel files in r and this is because the excel format is copied or is copyrighted by microsoft so you are not allowed to have an open source program like r which has an excel reader and writer we have 90 of our data stored in excel yeah don't do that excel is really really bad what do i do save it as text files or put it in a database because there are many many nature publications where they claim that a gene called october 19th is the causal gene for their phenotype and there is actually a gene called oct 9 or oct 19 and that's the gene name but as soon as excel sees oct 19 it directly changes it to the 19th of october because it interprets it as a date or and that that's just bad and there's many gene names which look like dates right so there's also a gene i think called uh john 2 it's common practice in social political sciences unfortunately yeah yeah because like if you want to lie with your data you put it in excel right and in theory social and political sciences are not the most hard sciences that there are but yeah no excel files are a they're not bad it's just very very annoying in biology because there's a lot of gene names which look like dates and excel will automatically eat them and there's like there's there's youtube videos and there's like even scientific papers written about why you should not use excel for scientific data um but for some data it's perfectly fine right i i use excel to um to register who registered for the course because there it doesn't really matter like none of your names are interpretable as dates and there's like it can't really do anything weird with like your student id or with your gmail account or these kinds of things um so let's say i get an excel data set how do i transform it before entering it into r well you don't have to because there are packages available there are two libraries available for r that you can use to directly load your excel file into r but generally when i get an excel file the first thing that i do and i don't have that open at the moment yeah so generally let me add a window capture oh let me make sure that i actually close your guys email addresses so that that's not being transmitted on stream um so let me add a input capture which is a window capture i want to add a window capture for excel right so here we have like an empty excel file um and let me make it a little bit smaller so oh that's the wrong window that i'm dragging so you never do a live demo so if i take my excel file right and i make it a little bit smaller so it doesn't like hit my face or anything um so let's open up an excel file for example one that one of my colleagues made that actually was on a external drive that's not connected anymore so i don't want to show you guys the ah something like this right so here we have a file which contains um data in excel which it actually don't save like this and i yeah so here we have a file which contains data right so generally the first thing that i do when i get a file like this i click on a value i do control a i do control c then i open notepad plus plus very quickly like this then i open up a new file i disable my um window capture i go to my file and i press control v and then i just save it as my data dot txt problem solved data out of excel excel can't eat my data at this point anymore it can't be smart it can't do anything it's now saved in text files i do have to check that it didn't change any of the gene names right like i have to really make sure that um for example my oct um well in this case there's no oct 11 gene um but yeah um if there would be a gene name which would be interpretable as a uh uh and not no there's no gene name here but it will like if you would put in a gene name and i don't know if i can actually force this um let me see if i go to the excel file again all right and now i'm going to say file new right and i'm just going to type in if i do new it actually switches it to a new workbook right now it's starting to crash and trying to do too many smart things at the same time right but now if i would if i would just have a list of gene names so i'm just going to copy some gene names out of the file that we just saw right and i'm going to put them in excel right and now i'm going to add my gene called oct nine and i'm code present enter you see what it does i typed oct nine and now all of a sudden it ate my data and it would be the same as if i would type january six right jan zes which would be a perfect gene name there might be a gene called g an six if i press enter it will and people have published in nature about nine minus oct and that's not a gene that exists and in excel you can fix that right you can say right click then format cells format the cells force it to be a text cell and then press okay and now what you see it actually now ate my data again because now the data that i typed in to this was oct nine but all of a sudden excel says no oct nine is four four eight four three so it transforms it to a time stamp and you don't want that you don't want that saving your file changes your data so long-term data storage should not happen in excel excel tries to be smart it tries to help you but 99 percent of the time it just puts you over the table and just screws you over and it shouldn't do that so don't use excel for long-term data storage use a text file like this and if it's big very very big zip it make it a zip file good so that just as an example that like why excel is really dangerous and there are many publications out there which are very good publications where at some point in the process they used excel and excel ate up their file names and excel is is famous for eating up your your data if you do want to read files directly into r you can use excel the two packages xls or open xls so both of these packages are slow and relatively unreliable like i say i always export excel files as csv so just open them up control a control c go to a text editor paste it in if you really want to you can use these packages to to to read directly data from excel i guess you can simply set in excel not to convert the data yeah that would work for me but then when i send the excel file to someone else who does not do that and then sends it back to me how am i supposed to know that at line 10 000 it all of a sudden ate up a gene name so it like you can do everything right but still someone else can screw you over you can open up the file on your home computer where you just forgot to do it you can get a new computer from the media market open up your file and all of a sudden your data changed and that should never happen science is about reproducible research so so please please stay far away from excel it it it won't screw you over directly but it will and you don't want to have your name on a publication where you claim that this gene 446189 is the gene that is and then everyone says like you're a jackass that gene's called oct 9 it's not called 44683 or something anyway you can use the read.xls function for example from the xls library to read in an excel file of course you then have to specify which sheet you want to load in because of course read.xls gives you back a matrix and this way you can you can read there's also a write xls file so if you really want to live dangerously you can from r directly try and write an excel file which then when you open it up in excel definitely excel will start eating data again because that's just what excel does and that's why excel is perfect for financial data because all of a sudden having a date like oct 9 being transformed in 40 000 just means that you're rich right that it's just the way it is anyway um let's do a quick break and then we're gonna do obama and we're gonna read binary files so we're going to use r to load in bmp files and show bmp files as a plot and extract bmp files as a matrix so that we can manipulate the different color components and that we can do some fun stuff with r so this is very advanced you're not going to use it it's just for having fun and that's what we're here about because we're here to learn how to program but we're also here to kind of learn that programming is is fun and creative and that you can have obama morph into a flappy bird for example or that you can have obama morph into trump and then back and just by taking images and doing manipulation and just doing fun stuff with it so binary files are next i'm going to take a short break goats for number one which means that the next one is going to be and now i have to think really really hard i don't know i i have no idea what the next break is going to be i really need some coffee so i will see you guys in around 10 minutes um and enjoy the animated gifts enjoy the music and then um i'll wait back in like 10 minutes with i hope a little bit of coffee so that i can focus a little bit more all right so thank you guys still for being here and um yeah we'll we'll be back in like 10 minutes and then we will continue with fun stuff reading binary files and working with bmp images in r so let me start some music i'm going to do barn music and then we're gonna switch to the second break so see you guys on the flip side back in time this one was a little bit longer so i hope everyone had a good break and got at least a little bit of coffee a little bit of sugar um so last part of the lecture so um like i told you guys binary files um there's a lot of different binary files like you have executable files and dll's and but for the example we're just going to use a basic little bmp file um so it's to show you guys what's possible and you're not going to use it a lot but sometimes you just want to and it's just fun um so you can use the readbin function so readbin loads binary files like images and you can for example say readbin my bmp n equals one and then it will load the first byte of the bmp file if you want to read the whole file you need to get the size of the file so you can get the size of the file by using the file.info function so you give this file.info function the name of the file that you want to load in and then from that you select the size the file info also contains like when was it last modified when was it created but in this case we want to have the size and then we can just say n equals and then this size of the file and it will load in the whole thing so you have to tell r how you want to load in the binary file so this has to do with the r-type system so there are many different types in r many more than we already discussed during the first two lecture and the readbin has a parameter called what and this controls how the file is loaded so you can say what is numeric double integer int logical complex character and raw and in many cases we just are going to use the raw function because all of the other ones interpret your binary data in a certain way and we don't want that right we don't want r to act like excel and do conversions for us no we just want to know the raw bytes that are in the file so about BMP images so BMP images are a two-dimensional array of pixels right so here we for example see a little image this image has 12 pixels the first one is greenish the sixth one is blue right and this is a two-dimensional array because this BMP image is an image which is six pixels by two pixels so it's six pixels wide two pixels high of course on your hard drive this file is a linear sequence of bytes so it's not a two-dimensional sequence or anything magical it's just nothing more than start of the file and then all of the bytes that follow it right so every BMP file that exists comes with a header and this header is 54 bytes long and this tells you the name or not so much the name of the file but it tells the operating system that this is a BMP file and it tells it the dimensions of the file and then the BMP file continues so after that for each pixel we get three different bytes so the first byte is how blue the pixel is the second one is how green it is and the third one is how red it is so instead of using an RGB color scheme BMP images use a BGR color scheme so this is more or less how it looks on the hard drive so the first 54 bytes are the header of the file and after that we have the color code of the first pixel so at position 55 56 and 57 at 55 we find the blue component of the first pixel 57 we find the red component of the first pixel and so on all right so that's how BMP files are stored on your hard drive and this is more or less how they are interpreted so how you should view them so for example during the assignments we will be loading in an image and so this is called so we just take the name of the file and we store the name of the file into image.file just so that I can reuse the name and the name is pretty long and the variable name is shorter right so I'm first going to ask for the file info so that I can get the size and I'm going to store that in my image.info and then the next step is to read in the whole file so I'm going to say read binary image file so the name of the file and then n is the size of the file and I'm going to give that as a numerical value so those are that is the amount of bytes that I want to load in and then what so how am I going to load it in I'm going to say give me the raw bytes so give me the raw numbers in the file if I then store this in something called myimage.data and I just type myimage.data and let it run all the way down the screen in r then you can see that this looks like this right so at position 97451 we have e4 which is a hexadecimal code which means that they instead of counting from 1 to 10 the computer generally internally counts zeros and ones which are then summarized into hexadecimals where you count on a 16 base system. It doesn't matter too much but we we will get back to that how that works but and so the data is nothing more than just a big vector every position in the vector contains a number stored as a hexadecimal value so this is the image that we're going to use during the assignments it's a 200 by 200 by 200 pixel by 200 pixel image of obama so had the first thing that we need to do before we can do anything with this image is of course to remove the header because the BMP header does not contain any data on colors or on pixels so how are we going to do that well we're going to take myimage.data since it is a vector we're going to select from the vector everything so what are we going to do we're going to just say minus c1 to 54 so throw away the first 54 entries in the vector right and then i'm going to store this as myimage.colordata so what happens is i have a vector from one to all the way to the end and i'm just going to say throw away the first 54 bytes so now the first byte so the blue component of the first pixel is stored at entry number one entry number two is the green component of the first pixel entry number three is the red component of the first pixel right so it's just removing part of this vector so after we've done that we are for example interested in selecting one of the color components out of the figure so for example if we want to generate this figure in r then for example we can say well i'm interested in the blue component right so every everything which is blue or the amount of blue at each pixel i want to extract so how can i do this well we already saw this it's relatively easy right so we're going to say create a sequence from one to the length of the myimage color data so from from one to the end of the image and i'm just going to step by three right so now i'm going to create a sequence which contains one four seven ten thirteen and so on right so i'm just going to point make an index vector which just takes every time the blue component of all of the pixels so then i'm going to take myimage.color data use the sequence that i just generated right so selecting the pixels or the blue component for each of the pixels i am then going to say as numeric so instead of using the hexadecimal values convert them to numerical values and then put them in a matrix which is 200 by 200 pixels which is the original image and then i'm going to store this in a variable called blue because it's the blue color component and now if i want to recreate this image in r the only thing that i have to do is just say image so i'm just going to use the image function give it the matrix that i just created and then it will create a matrix which looks like this so here you can see that there and the red color here on the side of obama actually has no real blue elements in there that's why it's yellow yellow means that it's very low you can see that the the blue elements are colored in red and red is of course the it is is is intense right so the color scale that i'm using here is from kind of white ish all the way to red red being the highest white being the lowest so you can see that there's almost no blue in this area there's no or not a lot of blue in this area and you can see that there's a lot of blue here but also in the blue jacket so that's how you use the the image function in r to more or less get get images into r select the color component that you want we can also do this for example for the red color component which would mean that we create a sequence from from three to the length of the image stepping by three same system good so this was normally the break so here we've seen several functions to load in data into r so we can use the data function to get data from our packages we can use read table read csv we can use the read lines function to read in text files either wholly in one go or to read it line by line and we can use the read bin function to read binary files but of course then the question becomes what do we do with the data after right we can we can color and we can do something like this where we just extract one of the components but generally when we load in data from a matrix then we then want to do manipulations on it so how do we do these manipulations so one of the things that i use a lot in r is the in function and it allows you to filter it allows you to ask questions which elements of a are also in b right so imagine that i have two matrices loaded into r one is called a and the other one is called b both matrices have a column called id then i can match these two together so i can ask a take the id column in b id column so which ones in a are also in b right and then i can make a subset of a saying that only show me or make a make a smaller matrix where i take a and now only take a and b so only take the rows of the a matrix which also which have an id which is also located in b i can do it the other way around of course as well i can ask which elements of b are in a and then subset b using this vector that i just created so the in function just gives you back a true false vector so for every row in a for every entry in a it will tell you if this entry in a is also found in b and this is very efficient you can do this on millions and millions of entries and it will run in a very very reasonable amount of time so it's much better than using this than using a for loop and then taking the first entry of a checking all the entries of b and then setting it to true so which we already saw it transforms a logical vector into a numeric one so it tells you the indexes which are true right so it it does more or less the same it doesn't do the same thing as the in function so the in function gives you back this true false vector but you generally compare it with which to get the rows of the matrix in a which are also in b yeah so if i have a vector which is true false true true false then if i ask which on this factor it will tell me one three and four are true so generally what you can do is you can say which a and b these are the indexes that i want and then a indexes will do the same thing as just using a and b directly it's just more clear that you are using the indexes when you switch so the which function is really useful because it transforms a logical vector into a numeric vector and it shows you which elements were true in the original vector so you can also make subsets in this way right so you can subset a matrix or a vector by logical vectors and for example if you want to take all columns containing a value higher than six right so imagine that you have a big numeric matrix and you're interested in which columns have a value which is higher than six then you can say well i first make a selection vector myself i call this selection initially this is all false right so for every column in a i'm just going to say false false false false false because initially i don't know which of the columns is higher than six or contains a value higher than six right then i'm going to just do a very basic for loop saying four x in one to the number of columns of a i'm going to check if any of the values in this column is higher than six if this is the case then i'm going to select my i'm going to put my selection at this point to true so when this is done running now if my selection vector and then i can say from matrix a only take the columns which were true or i can use the which selection to select them by index doing the exact same thing it's just that the which function is a little bit clearer so you can do it like this right so just a little example on how you can use um head the how you can build your own logical vectors to allow you to select from your matrix and make a make a sub matrix so of course you don't have to go through the columns you can also go through the rows and then ask questions about the rows but generally you want to keep columns and so this is very useful if you want to remove columns which have missing values or if you have a column and you say well keep all of the columns where there's less than 10 missing data these kinds of questions so you can do this with a very basic for loop again we use a very common idiom in r where we first make an empty vector or a vector which contains only false and then we select the corresponding elements by setting them to true after which we can use the selection vector to make a subset of a bigger matrix you can also use the subset function so the subset function is there to create subsets of matrices and data frames i don't use it a lot but i know a lot of people that do so for example here we have the air quality data set the air quality data set is one of these famous data sets in r which a lot of tutorials use it has four columns the first column is called temperature second column is called day the third column is called ozone and the third fourth column is called wind so it's just from a certain month they've measured the temperature the ozone concentration and the wind speed and these are the different columns of this air quality data frame so we can also say data air quality to load it and then we can say subset air quality select the entries where the temperature is above 80 degrees fahrenheit and i want to select only the columns ozone and temperature you can also use it to subset and then say well say give me all of the columns give me all of the rows where the days is equal to one and select everything but minus 10 so throw away the temperature column so give me the column day ozone and wind right so it allows you to also do a negative so an inverse selection by saying don't give me this column and then it will give you all the other ones you don't have to have a selection parameter right if you just want to select two columns or select multiple columns you can also leave the middle part out so you don't have to have a selection for a certain value in a certain column so you can just say subset air quality data set select is ozone through wind and now it will take all of the columns starting with the ozone column ending at the wind column so but then you have to specify select good so those are some ways of getting big data sets into r and making a little bit of subset so we will practice this during the assignment so i wanted to show you the in and the which so if we want to so we now can load our data from a file we can manipulate it right we can make a subset and then we want to of course write it out so for writing our data into a text file or to into a comma separated file there is the right table so if you have your matrix or you have your data frame and you want to save it to a file you can use right table it again has a lot of options the options i always use is these so right table give it the file name separator is set to true row names is false quote is false because this allows me to take the the table which was written do control a or no it allows me to just take the file and drag it into excel and it will directly load the file in a proper way of course it will start eating up some of my data because if there's oct nine as a gene name in one of the columns it will transform it but a lot of times the people that i work with don't really work with text files so they do want to see their data in excel so then you have to make files and this is just the easiest way to do it because i can just open up an empty excel document just take the file drag it into excel and excel will understand the structure of the file and will load it improperly so don't use row names don't use quoting set the separator to tab and then head just write a certain matrix into your file so there's multiple ways of saving data right so one of the things that we already saw was using the cut function right so the cut function is there to either save data into a log file but i also use the cut function a lot when i have this going through all of the columns in a big matrix or going through all of the rows so i often have like a little progress report in r saying that i have done one out of a hundred two out of a hundred so that i can estimate how long the whole computation is going to take right because sometimes you write code that code runs a long time and then you want to know can i sit here behind my computer and wait until it finishes or is it just time to go home and come back tomorrow so i always use the following system so i write a for loop where i say for x in one to each of the columns of the data i do my computation code here and then at the end of the for loop i have this one line saying cut done x slash and call big data slash new line right so every time that it finishes processing one column it will cut this to the screen of course i can add comma file is append is true and then it will write it to a file i also use this a lot for like log files right so here you have a progress report you can write these progress reports to a log file saying append is true file is log dot txt right so you just say cut message store it through this file and append it to the bottom of the file if i want to empty the log file then i can say cut nothing so an empty string into this file and this will clear the whole file and this is also really useful when you want to build for example a computation which you can continue later on right so you make an empty result file and then you just start computing every time that you have one row of computation done you save it to a file and then if the power goes out or your computer crashes or a tornado happens and it destroys your computer well of course then it's really hard to continue unless you're saving to a cloud storage or something but hey in theory you can use this as well to continue computations later on so hey the cut function is really versatile you can print to the screen you can print two files and have by saying append is true it will just add to the file just the lines that are already there so how do you do a continued analysis so you store computations as you go right so here we have a little example so for example i have i first do an empty file right so i have a temp file which i empty so i only do this once so or in or when i want to reset the computation then i generate a big data matrix so in this case i have 10 000 rows a thousand columns right so some big data i load in an empty result matrix called tmp and this is a matrix containing n a's no rows no columns and then if this temp file exists i'm going to read it from the hard drive because i'm going to save in temp.txt i'm going to save every row after i did my computation right so if the file exists then i'm just going to load from this file if the file doesn't exist then it it shouldn't load from the file because i didn't do anything yet right so i'm going to say if the file exists read this file using the separator and then put it in the temp variable so how does this work now so now we have either loaded the file or not and then we can say 4x in the maximum of one to the n row of temp plus one right so continue from where we left off so if i've already got a thousand rows in my file i know i now need to continue at row 1001 so n rows of temp plus one or one when when one is higher than this because the number of rows of temp can also be zero right there cannot be an edit the file could be empty so i'm just putting my starting point to where we left off and i'm going to continue until the number of rows of big data then here i have my analysis code which might run for like half an hour for each of the rows and then every time that i have my result i'm going to write the result to this temp.txt file right so i'm just going to say paste x to the results because x is the row that i'm currently looking at take my results separate them by tab and then i put a new line at the end write this to the file and append it at the bottom of the file and of course then i also print my progress saying that i'm done with row number x right so this will just one by one fill up the file and then if something happens like the power goes out at line a thousand then i can start again later on by starting at row starting at line a thousand and one good the last section is biomarked so if i need my data in r right i can manually search and create an excel file which is a lot of manual slave labor and is very very error prone because excel can read up can mess up my data and i can make a copy paste error so that's generally not the way that you want to collect large amounts of data another way to download your data in most cases if you think about biological databases or big databases with financial data they also provide a bulk download so they have an ftp site of course there's less chance of errors then but the problem is if i'm reading data for example from google financial and i'm reading data from yahoo financial or microsoft financial then all of these data sets these bulk data sets have different formats so i need to harmonize the formats before i can do anything with this data so and this holds the same for a lot of biological databases so if i have data and i want to get data from most biological databases i can use biomarked and biomarked is a deep preferred way to retrieve data directly into r so if i'm interested in for example a certain gene in say elixir i'm interested in a certain gene in drosophila or mouse or humans i can directly download all of the data that is known about this gene from biomarked in a structured way without having to go to the website look it up or go to the ftp site and download all of the genes in the mouse genome so biomarked is a community-driven project and it so it's it's driven by the biological community so it's paid for by different universities and it promotes unified access to distributed research data to facilitate the scientific discovery project and it connects most if not all biological relevant databases so that means that things like keg ensemble ucsc the big db snp have which contains variation data all of these databases they have a biomarked api so they have a biomarked plugin which can be queried by biomarked and you can get data back in a certain structure and you can define what kind of structure you want there's also different api so you can use it from r but also from pearl and python you can even use it from soap and rest and xml if you wanted to but we're just going to show you some examples in r but be aware that also if you want to have if you program in python or if you program in pearl um you can also use biomark to directly download data from biological databases so there are three things that you have to know about biomarked so biomarked functions by having something which is called a mart so a mart is a link to a database for example the snp database in mouse or the gene database in humans or the variation database in in drosophila right so there's different marts that you can choose from if you just want to see them see what's available you can type list marts of course you have to install the biomark package first and i'm going to show you that later but by listing the march you can get an overview of which databases you can connect to and which data sets are available if you want to then query things we we need to know what we can ask of the database right so these are called attributes attributes are things that we can retrieve from the database so if i want to if i connect it to a different to a mart right so if i made a connection then afterwards i can use my mart to list all of the attributes so these are things that i can query so for example genames gene identifiers start location stop location chromosome uh artology do you also have other suggestions like biomark but for social and political sciences that is a question that i have to take some time but there there probably is something like that um but i don't know if it would be that if it would be that common in a way um because in biology there's this big concept that genes are shared between different individuals and i don't know if there's something that big in in social and political sciences right because biomark is literally a project where like hundreds of universities are involved to make sure that everything can connect together and can talk to each other but i will google around so there might be apis for social and political sciences to download for example voting results across whole of europe i could bet or i would imagine that within the european union um the different elections in the different countries of the union will all have an api which allows you to to query data um but biomark is very very optimized for biological data because of course there's like 20 000 genes in a human genome and all of these 20 000 genes are also in mouse but on different positions in the genome and they are different lengths and different variants are available but i will google and i will get to i will get back to you about that question because there probably is something in that matter for biological or for political sciences as well so we have a mart which is a database connection we have attributes which are things that we can retrieve and then we have to tell the database what we are going to query by right because i can send the database a list of names of genes but i can also send the database a list of locations i can also send the database a list of um descriptions and then it will use the description to find the corresponding gene right so i need to tell the database what my value means and what my value means is called a filter so there are so there are three things that i need to specify which database do i want to connect to what do i want to retrieve and what do the values that i am providing you mean so how do we query that so as a little example we first have to install the package so the the problem with the biomark package is that it's not available on cram so it's not available on the standard r repository it is in biomark it's in it's in bioconductor so bioconductor is a package repository for r aimed at bio informatics so i can just connect to them by using this magical incantation so if it's not installed install it and if it is installed then install the biomark package so once i've installed biomark i can load the functions in biomark and here for example i can say connect to the snip database from ensemble and connect to the mouse snips so to to variance in the mouse genome right so this is called my mart so snip they bay and i say use mart use this mart use this data set from this from this database and then this is my database connection and then i can do the query right so i can say get from bio mart these are the attributes that i want to retrieve i want to retrieve the reference snip id i want to retrieve the allele so if it's an ac snip or a gt variant or a tc variant and then i want to know which chromosome it is located on and what is the start position which is also the end position because snips are single nucleotide polymorphism so there are positions in the dna where there's in the population some individuals have a g and other individuals have a t for example and then the filter that i'm going to use is snip filter so snip filter just means i'm going to specify snips by their official snip id and then here i'm going to provide the value so this is what i'm going to query for so this is the id of a snip that i might be interested in and then of course i have to specify which mart i want to use so in this case query it from the snip database good so that's a very very short introduction to bio mart there's a lot of different databases you can list them all you can list all of the attributes generally these databases have thousands of attributes and they have hundreds of different filters that you can use to kind of get the data that you want but for today i think this is all so there is one question about bio mart which of course if you're doing political sciences or social sciences is not that useful but it's just to show you how to use an api and and like i said many different fields of different apis i bet that there is an api for election results within the european union so that you can compare like voting behavior in france with austria and these kinds of things good so we actually made very very good time so it's four four seven so it's like a quarter to five so even by doing all of the assignments with you guys just one by one just me typing them in and explaining why i'm doing what i'm doing we went really well so we still have 13 people watching which is really good which means that we only lost like one third of the people that signed up for the course which is not too bad i know that the beginning is hard and it's something that i can't really change right programming is is hard you have to spend the time to get familiar with it could you explain again what an api is okay so an api is called an application programming interface so an application programming interface is a more or less special website which is not supposed to be visited by humans it's supposed to be visited by computers so you have for example the google api so the google api allows me to you to write a program which queries data from for example youtube like your comments that you are doing on the stream right and using this api the computer can retrieve this data into a computer readable format and then for example save it to my hard drive or display it on my on my window here right i am using obs so i can use overlays to overlay information on on the window if i wanted to so an api is something that that is very broadly defined as an as a website an endpoint where you provide specific queries and these queries result in data being collected from the database which is then presented in a way that a computer can can read so generally api is provided with uh json data um let me see if i can find an example because there is api example google financial which i used a lot in the past but i don't know if it's still online oh it actually got offline let me see if there's a nice example of an api ah for example the the weather data um so a lot of weather forecasts and weather companies they have um yeah they they have an api which provides data on weather so hey you provide the location then they provide computer readable like a very short description of its sunny this many degrees this is the temperature um youtube has an api which you can use to embed videos in your website using javascript yeah so it's just a way of connecting something which is available online automatically without a human having to go to a website and there's there's many different apis out there google maps as well so it has an api which you can use to load a map with custom markers so you just provide the markers and you provide where you want to look at so yeah there's many different many different apis so all right and this is going to remove very good very good very good all right so does that explain your question Leonardo or do you want to have a real real api example um because i could i could make an example for you next week because i hadn't prepared an api example it just stands for application programming interface which means that you can write an application which automatically queries data from a database that you don't control perfect all right so there's no further questions um would be cool okay okay then i will look for an api on political sciences and then we're going to do an example on that if i find one um political and social sciences um let me write that down so that i don't forget that i don't forget that's not the one that i wanted to use i want to use ah i can't reach my piece of paper all right there we are again so api example and if i don't find one then i will just use a google maps um example because i did use the google maps api on my website all right if that's everything then guys thank you for being here so much um i'm still discussing to get a room um so we can do the assignments in person i don't i don't know i i don't know exactly what's going on with walthung and why they are ignoring my emails and it's it's weird but the big issue is is that i i had to submit all of the paperwork for the course in like november of last year and then we didn't have the option to do it in person so i never got assigned a room and now apparently everyone already got a room except for us and we're at the back of the queue so and since in theory we are going to be as many as 35 people we need a big room as well which is a little bit of a difficulty because i think most of the big rooms are taken um but i would i would really love to be able to see you guys in person and um help you guys directly by watching over you while you program and being able to directly help you and like touch your keyboard when necessary so all right um if there's no further questions no further remarks then thank you guys for being here so much um i actually make a finish screen but i think it will mute all of my microphone and audio when i go to the finish screen so i first want to thank you guys so much for being here and um thanks you for all the questions the more questions we have the more fun it is um and i hope to see you guys all next week i hope in person and i hope that i get a response before the end of the week um so that i can mail you guys but it might be that i mail you next week on Wednesday evening saying that we can do it in person um if that would be the case then that would be the case if i if you don't hear anything then assume that it's going to be online but i do hope that we can go and have a good time in person so see you guys next week thank you guys for being here thank you for liking the stream and all of the things and then yeah see you on the flip side all right bye