 the recording so welcome back everyone also when you're watching it on Moodle. We just talked about how you can load in a BMP image and then display it in the rplot window just using the image function like this so I hope that that will be useful. We will get back to it at a certain point in time and you guys can practice a little bit during the assignments as well but trust me it's not gonna be on the exam. Alright so here's where the break should have been so I already told you about several functions to load data into R like the data for loading data from an R package read.table read CSV you can read lines yeah I'm recording I'm recording I definitely need like a red thing somewhere so that people know that I'm recording so and we can read binary files but then the question becomes what to do after right now I've loaded my data into R and now I want to do something with it so generally what you want to do with your data is something like filtering right generally you don't load one data set in but generally you have different data sets right so I have for example a data set which contains some measurements on individuals and then I did sequencing and I have another file where there are sequences for certain individuals so hey if we want to combine data or filter data we can use the in function for that have for example imagine that I have two matrices matrix A and matrix B and I want I want to know if both both matrices have an ID column right so matrix A has a column called ID matrix B has a column called ID and now I want to take all of the elements of matrix A which have a corresponding ID in matrix B what I can then ask is a take the ID column which of these or which are in matrix B in the ID column so I can match two matrices together and then I can say well I want to take the subset of matrix A where all the elements in A also occur in matrix B so I can then use this this logical vector right this will just say true false true true true for all the elements in A so I can create a subset of matrix A using this true false vector very similar to what we saw before saying that a measurement larger than three right that's the same thing that we're doing here but now we're taking two matrices and asking well what is which elements of A are in B of course we can ask the opposite question as well we can also ask which elements of B are in A and then we can use that to subset matrix B the which allows you to transform this logical vector into a numeric one so by the index so imagine that if we do A in B then it says true false true true false right so and then if I say which then it will tell me one three and four so if I'm writing a for loop right and I want to go and do for something and so I can just use the which on the vector that we just created to get the indexes and then I can use the indexes to make a subset but of course I can also say for index in indexes and then go through the matrix A and take only the rows which are also in matrix B and do some computation with it so the which transforms a logical vector into a numeric one and it uses the index number for that so if I have a vector like this then it will say one three and four because one is true three is true and four is true so I can use this then to do the subset or I can use it in a for loop to go through the matrix and do something with the rows in a which also occur in B of course you can also use the subset function has so you can subset a matrix a vector or data frame using a logical vector and it takes for example all the columns with a value higher than six and so I can say well I have for example something which I call selection right because before I don't know which columns will be higher which columns in a will have a value higher than six so I'm just going to say repeat the value false for all the columns in a so now I have a true false vector right and this is just false false false false false so it has the length of the number of columns of a and then I can for example go through each of the columns in a and then I can say well if any of this has so I select a column x from a I test if it's larger than six and if any of the numbers are larger than six then I say well this thing I need to select because this matches my criteria so some value in a in this column needs to be higher than six and then of course I can then use this as well to again make a subset of my matrix so take the columns directly using selection or I can say which selection so this will just use the true false factor this will first take the trial through false factor transform it to numerical values and then select the columns by number so again just making your data matrix smaller after you've loaded it in based on some kind of selection criteria and this is something that is very common in which I run into a lot that I load in a matrix and now I want to know for example which column contains a certain value and only take the columns that have that value or a value which is higher or lower all right so you can create subsets of matrices like you can also use the subset function I don't use this a lot but I know a lot of people who use it so they don't like to use this which in structure or do the matching themselves by using a for loop to go through it so they use the subset function so here we are looking at the air quality data set and the air quality data set which comes built into R has four columns it has a column called temp column called day a temper a column called ozone and a column called wind so first thing that I do in R is I load the data set and then I can use the subset function and the subset function takes three parameters or well it actually takes two parameters at the minimum but the first parameter is the data set that you want to take a subset of then the second is specifying the column that you want to check so take the column temperature higher than 80 right so take all of the rows for which the temperature column is above 80 and then what do I want to select well I only want to select the ozone column and the temperature column so this will transform your air quality data which has four columns and has a bunch of rows it will now give you back the subset for which the temperature is higher than 80 and it will only give you back the ozone and the temperature column you can also because it all says the day column you can also say subset the air quality data set where the day is one so the first measurement day and then select and then do minus temperature so select all of the columns except for the temperature column so you can that this is the same as when you throw away things from a vector this is throwing away a single column by the name of the column you can also subset the air quality data set and don't have a selection parameter so to speak because here we are not aren't don't have a filter parameter so you can leave the filter parameter empty so you can say subset air quality and then select for example all the columns between ozone and wind so in this case because it has four columns and ozone wind are the last it will just take the last two columns but if you say day to wind then it will take day ozone and wind so it's a it's a different way of subsetting your matrix matrix and a lot of people like this I'm not the biggest fan of it but of course I just want to show you guys that you can use the subset function to do the same thing as what you do for example with selection or something that you do with this which in structure and where you just kind of directly specifying your your selection criteria we will get back to the air quality data set when we get to plots so when we start doing plots we will use the air quality data set as an example on how to create plots showing the relationship between days and winds and ozone and temperature all right so then once we have our matrix right we have done our subset so we have selected the things that we need then for example we want to write it out to a file because my struck or my advice is always when you make an r script then the r script should begin with loading in data doing manipulations on your data and then writing out a new matrix and then you have a next script which takes the matrix which you just created and then does some other manipulations with it or does some statistics on it and then writes out the statistics in the end so the right table function here you see the descriptor of the of the function call so it has a lot of parameters but the parameters that I always use and I use the following options because I can then drag the text file to excel so when I open up excel and I drag the text file created by having these options it will automatically go into excel and it will automatically show the matrix in a proper way because excel only supports the the tab separator so how do I write out a matrix I say write table then you give the name of the matrix in this case a so the matrix that we that we have been using then I say file equals a.txt or give it a good file name or a better file name the separator is true the row names are false and the quote is also false so you have to remember that this cuts off the row names so if you want to keep the row names you have to add a line above saying that you want to see bind the row names to the matrix so that because the row names themselves are not a column of the matrix they are the identifiers of each row but if you use this and you say row names is false then you can just directly drag it into excel and excel will recognize it as being a 2d matrix and it will load it into excel properly so these are these these are the options that I always use yeah of course there are risk from redeem next slide in German okay that's gonna be fun I actually have no idea what the next slide is so so these are the options that I use but there's a lot of different options the one that you actually want to sometimes set yourself is the na equals because sometimes you want to say that no I want to use for the missing values I want to use like a dash or an x or something like that and you can set that using na equals x the eol is the end of line so that standard is slash n if you're in windows you might sometimes want to use slash r slash n so but that's a window specific thing it might actually be that when you install r on windows that the default for end of line is actually slash r slash n but that's something that I have to check but the normally I only use these parameters and this works perfectly fine so it writes out the thing and then hey I just drag it into excel so that I can look at it and kind of scroll through it more quickly all right let me get a sip of coffee all right next file next slide in German one of them is to show you the progress of today's preview in the r window so in the in the in the r window so that means that if for example I say before x in 1 until the number the number of columns in the big data matrix that I have so big data here is a matrix question if you don't specify it will return na as well if you don't specify it it will use the na it will just say na in capital letters so that's the default it will just use the default value if you don't specify it so what we're here making is we're going through a ganz grosser matrix durch den column von diese matrix und was wir machen ist am ende von den for loop habe ich so etwas wie katton x slash n col big data so wenn es ziemlich viel zeit kostet um durch einen ganz grosser matrix hinzugehen dann dann zeigt ich jedes mal wenn ich eine column von den matrix gemacht hab dann zeigt er mich ein meldung mit wo ich bin man kann es auch ganz gut nutzen wenn man ein lock file anlegen wollen so wenn man ein file haben wollen und man will einfach in den file gucken und sehen wie weit wir sind dann mache ich oft kat und dann mit einem ein message oder ein etwas was ich in die file und dann sage ich speichere das nach ein file und den file name is lock und die xd und danach dass ich an diese file anhängen wollen so ich sage append is true und das bedeutet dass den den message den ich speichern will am ende dieses file an diese file zugefügt werden ich kann auch den kat funktion nutzen um ein file zu lernen so das bedeutet dass wenn ich ein file habe zum beispiel lock punkt xd das erste statement am den lock file so den ersten statement in den den ich mache in mein script ist sage ich kat nichts so einfach ein ein leeres charakter nach den file lock punkt xd und das klärt den ganzen lock weil ich keine append statement nutzen all right i will do it quickly in in english as well so if i want to have a progress report in the r window i can do something like this right so i can say four x in one to the number of columns of big data um what do i want to do well i normally have this line last so i have all the manipulations and it might be that i have a function that runs for like 50 minutes per column and so then after 50 minutes it will then print done column one of the number of columns so i get it like a progress report which just runs into r and of course if my data is big or i have a lot of manipulations that i do so it takes a long time to do a single column then now i have a nice progress report which update updates me on how long it will well not how long it will last but where i am you can also use it for a lock file just print a message to the file called lock dot txt you have to append this true um and i use this a lot in my scripts so my scripts um will okay next slide in in dutch i might want to limit this to like three times like it's like a couple of times is fun but if i have to switch language too much we are going to take like five minutes for a slide but um so i can use this for a lock file and use this in a lot of my scripts where i just do all kinds of statements right and at the end of each statement i just write something to a lock file and you have to make sure that you append to this file um otherwise it will just empty out the file so saying cut nothing to file lock will just clear the whole file um and remove everything that's in there so the cut nothing file is some file is a very dangerous statement and you don't want to do this on your input file because it will throw away your entire input file okay this is actually a pretty important slide so i do it in the netherlands and then i do it in the dutch or we do it in the english not in the dutch and this is about calculations so um to do so this is not the only slide there are two slides but this is in state so for a gigantic matrix one by one the calculations to do and then at the moment that your mother comes in and your computer has to go out or the stream falls out then at that moment you can directly say oh my god i have to go so the computer has to go out and then you can go further later so what happens is first i make a file and that file temp and that temp file tmp.txt goes to our calculation that we are going to do it will keep it so first before i start calculating i make the file empty once so reset do just make an empty file in this case i filled a large matrix with random numbers between 0 and 1 and this this file has 10 000 rules and 1000 columns and that means that if i want to calculate something like correlations that really takes a couple of hours so what i then what i do is then i say well i make a temp when i first have to see if i already have calculations so i make a empty matrix um and at the moment that is the file that i already have bestowed then i let the file in and when it is bestowed then i let it in and then i go through with the calculations um so and then the next slide will go by i will go through the slide also in english all right so storing computations as you go is really useful when you have like something which takes a long time and halfway through the power might cut out right and then you would lose all of the time that you've already invested and you don't want to lose the time that you already invested so the thing that you have to do then is to first and so here we're going to store the results that we have so far in a file called temp.txt so first things first i'm going to empty up the file then i'm going to make a matrix which is huge it's a matrix which has 10 000 rows um and it has a thousand columns so doing a correlation on this single matrix big data will take two to three hours of course it might be that halfway through you have to leave or the power cuts out and then you lose everything so the the structure here how to do it is is this slide plus the next slide so this is the first part so this is the first part is seeing if there's already some analysis that we need to load in so here let me make a mark because this is wrong minus analysis right so what we are doing is we're creating an empty matrix so it has nothing in there and it has no rows no columns and if this file exists then i'm going to load this file into our variable temp so and then if we then do the next steps right so the next steps might be doing correlation of one column against the next so what do we do is now instead of doing the for loop from one to the number of rows or one to the number of columns we're going to say four x in the maximum of one and the number of rows of temp plus one so because temp might be empty it starts from one so then the number of rows of temp will be zero plus one then it will take the maximum of one and one which is one however it might be that this file temp dot txt already contained like a thousand lines right so if it already contains a thousand lines what will happen at this point is that it will say well the number of rows of temp is a thousand so it should now start at the maximum of one and a thousand and one so it will start at a thousand and one and then continue on to all of the rows that it has in the matrix so we do our analysis and then what we do is we just write the result to the temp file and we use the cut function to do that right so we just append a single line so every time that we get a correlation coefficient or a list of correlation coefficients has stored in results what we then do is we just paste the line on which we currently are the correlation results and we just separate it by tab and then we add a new line at the end to make sure that we go to the next line and we store this in the file temp.txt and we append to the file and we show a progress indicator so doing this will allow you to continue your analysis halfway through because at this point I can just press quit in R or I can just pull the plug from the computer it will stop right R will kind of crash and at that point we still have the calculations that we did stored in this temp.txt file and by using this structure we then load it in if it exists and if it doesn't exist then we just use an empty matrix and then we just row bind to it all right so two very advanced concepts we're probably not going to use them directly although there are two assignments and this is just for you guys to kind of see if you can get it working and see if you can find the idea behind it all right so external data last part of the lecture biomarked so if I have a big biological data set like ensemble right and if I need my data in R then I can manually search and create an Excel file but the problem there is that there's a lot of manual slave labor involved because I have to do like a search in ensemble then see what ensemble gives me then I have to copy paste this to my excel file to build up my data set a lot of online data sources they offer things like a bulk download like ensemble has an ftp site where you can just download for example all genes in the mouse genome or all genes in the human genome of course now there's less chance of errors right because I'm not copy-pasting stuff making my own data set but the problem here is that when I download these data sets the data sets have very different structures sometimes if I download my data from ensemble it will have eight columns however if I download a very similar data set from a ucsc or from another data source then the data might look completely different might have like 12 columns um so to solve this um at this difference that every biological database is slightly different and the fact that I don't want to manually like download data and create an excel file I can use biomarked so biomarked is an r package um which allows you to retrieve data from the main biological databases like ensemble and ucsc and you can retrieve the data directly in r so biomarked is a community different project that provides unified access to distributed research data to facilitate the scientific discovery process and it provides most if not all biological relevant databases it's not something which is specific to r you can also use biomarked from pearl or from python or from xml or using a rest api um but we'll be using r since this is an r lecture right so there are three important concepts when we talk about biomarked so when we talk about biomarked then we need to know that the first concept is the concept of a mart and a mart is like um a shopping mall kind of thing like walmart right walmart is mart so hey it's like a shopping center so the mart in biomarked is the link to the database that you want to connect to so for example the ensemble snip database for mouse or the ensemble gene database for humans or the ensemble uh transcriptome for fish right so those are all different data providers different marts um so besides the mart and if we want to show all possible marts that we can connect to we can use the list marts function and this will give us back a list of all available marts um that we can connect to um and i think most of them will be ensemble but there will be some other databases in there as well so what we want to retrieve from this shopping mart is called an attribute so once i've connected to a mart i can use the list attributes on the mart or on the on the variable that i get back to show which things we can retrieve from this database the filter is there to specify what we want to retrieve so we we provide something to the mart right we provide for example gene identifiers or we provide a chromosomal region or we provide so to specify to the data provider what we are going to query for we use filters so it kind of a filter is kind of um something that that tells the data provider what our values mean because we have to give them values right we can't just say give me everything no for example i want to have chromosome one from one to a thousand base pairs but then i have to say that this chromosome one from one to a thousand base pairs is a genomic region because otherwise it would not know what i want to retrieve so three different the three different concepts which we can use um and which make bio mart very flexible in a way all right so the first thing that we have to do is install bio mart into r so um if you have an older version of r you have to use this source bio conductor bioxylite and then you can install it by using bioxylite bio mart if you have a newer version of r you can just google so you can just say bio mart install um and then you will go to a web page and there it will show you the commands that you need to install bio mart in r so the problem with bio mart is that it's not available in chron so you can't just install it from the main r repository no it's something which is provided by bio conductor so bio conductor is just a different database with more packages for r um and in this case we just have to install the package so we load the package so that we can use the functions and then we want to for example use the snip database for mouse so i'm going to say use mart snip data set is most muscular snip this changes every time so use the list mart functions to see how the databases are currently called because i think it's not called snip anymore i think now the mart is called ensemble underscore snip um and of course you don't have to connect to ensemble you can also connect to some other database but we use the mart right so we specify which data provider we want to use we specify which data set we want to have of course ensemble has not just got snips on mouse but it also has got snips on different types of fish and cow and cattle and and and and goats and these kinds of things and then we get this snip they bay right so this returns more or less an object from which we can read it's kind of very similar to the file object and then for example we can use this get bio mart function and to get bio mart function we have to specify the attributes that we want to retrieve so in this case we want to retrieve the ref snip id so the id for the snip we want to retrieve the allele so we want to see if it's an a to g snip or a c to t snip uh we want to have we want to return the chromosome name and we want to have the start position and of course since snips are single nucleotide polymorphisms the end position will be the same as the start position so just retrieving the start position is enough in this case we are going to use the snip filter and that means that i have to specify snip names i could use a different filter i could have used a filter called chromosomal region but in this case i'm just going to specify the single nucleotide polymorphism by the name of the single nucleotide polymorphism the values are the ones that are the things that i want to query for so in this case i'm just going to query for a single snip called rs37395614 and then i have to specify the mart so where do i want to retrieve it from well the mart that i want to use is the mart that i just connected to so the snip dot debay that's it for today so very short overview in bio mart we will have an assignment about bio mart and there we are going to read or a whole bunch of snips from a file and we're going to retrieve them from bio mart so we're going to retrieve where they are located so just for you guys to kind of practice on where do we want to go and have what can you do with bio mart but bio mart is very very flexible so it's definitely worth reading on a little bit and the the nice thing is it just has like two main functions so you have to use mart to connect to a data provider and then you have to get bm which allows you to get stuff from this data provider and it will directly give back a matrix with the data that you need all right so that's it for me for today why do i need to use the c well the c is just there because you could have specified the list in this case you don't have to use the c it actually becomes more complex because you can actually do two filters at the same time so you could have like a snip filter and then a chromosomal region filter so then it would get only these snips when they are in a certain region um and it's more logical when you use genes right so i could say get all of the genes on chromosome three from a certain start to a certain end position and then i only want to have genes which are um protein coding or which are micro RNAs um and then of course the values that i provide is a list and then the first element of the list is the first filter and then the second element of the list is the second filter but i'm using the c here because this normally is not a single value that you want to retrieve because of course you use it for bulk data retrieval so normally you would because the values here the c i use to specify that this is a vector normally in this case if you only want to retrieve one thing you don't have to use the c um but in in general you would want to retrieve a vector or a list of things any more questions questions can be about any part of the lecture so if you are if you have a question about something else then please do that's what we're here for let me check moodle to see if i have everything uploaded for today um let me also directly actually save the um save the the pdf quickly go back to the beginning where we had the books check if they are okay yes they look fine and then i will go to moodle and i will see if i have everything on there all right so i i put the lecture online already um so the recordings will come tomorrow um the assignments for three are already there i also updated the um the answers for lecture two and we also have the data file that we need for assignment three because we're going to load in data um thanks by yeah by alexander see you next week or see you on tuesday if you have any questions so let me turn editing on and let me throw the three books that i advised on moodle and then let's hope that that's and i will move them all the way up so then the first books are there standing statistics with r and a beginner's guide to r yes 3 p.m every tuesday um it's also in the moodle by the way so the zoom link is on the moodle it's all the way on the top so you have the data or the course information and then the second one is the zoom link so also there you can just click the link on 3 p.m and then you should automatically join the thing of course if you have any issues joining the zoom meeting then let me know then we can fix that but i think for today yeah so if the lecture the assignment we have the data which belongs to the assignment and then we have the three books online now so for anyone interested in reading the uh the free springer books they can be downloaded from moodle that's it for today so i actually wanted to try something with you guys who are still here because there is a feature in um there's a feature in twitch which i never used and that is called rating um let me see if i can find it i didn't plan this so i'll just um rate a channel that's it all right so now we need to find someone on twitch that we can rate because we can just and that's it's a pretty interesting concept rating so everyone who's currently watching me i can take you to another streamer so we can all just all of a sudden um denny rate yeah no but well we can rate me right because you're already in my channel but like um one of my favorite streamers and like probably uh oh that's that's for mature audiences do you think any of these three big books are particularly good yeah they're all good like the first one the beginner's guide to r is something which would be very very interesting to you guys uh to you guys um the the introductory statistics is a little bit harder um and it really works when you have um a statistics knowledge already if you don't have a lot of statistical knowledge right you don't know exactly what an ova is or something like that um then this the third book is the best because the understanding statistics using r assumes that you have some programming experience with r but that you have a very low statistical experience so it it kind of explains things like what is a t-test what is an an ova um using r to kind of introduce you to these concepts um if you already know a lot about statistics for example you have done an sas course and you know exactly what an an ova but you kind of want to um know now how to do it in r then the middle book the second book is the best um but yeah the first book um the beginner's guide to r that's a very basic book um to learn or to kind of get a feeling on how to do r um all right let me see so let me see if there's any science and technology guy that might be interested in oh there's me i'm just scrolling twitch to see if we can find someone to raid um so let me see physics course nah giraffes life at toronto zoo who loves giraffes wait there's no giraffes currently in in there are actually let me uh do the desktop audio all right see you later yeah see you and see you next week roberto good um welcome to zoo live tv i just wanted to try the the raid feature to to see if we can just i'm not a big twitch user or something like that so how do we think about hatchlings in a nesting box anyone has any ideas on on who we should raid anyone has a favorite streamer besides me that they think like oh that streamer could use a couple of viewers and we just all move over to that one channel of them um ooh gorillas live that's interesting just chickens are there any chickens in few well there's a there's a there's a chicken that's interesting thank you for lecture yeah by the yeah all right so um i just have to probably just decide for one of them um new pc building egg watching oh this is cool this is cool all right i'm just gonna say we're going to go to this channel so thanks Danny bye yep bye regularity see you next see you next week um let me see actually if i got the uh um can i actually make you a vip because that's something that i wanted to check out as well if i actually got my um so let me see add a new vip the uh arousal or vip there you go and then let's make roberto a vip as well add roll you will be a vip as well then we will um add gener general general gulag 93 yeah that's you you will be a vip as well and we will add uh regal as well you will be a vip as well there you go anyone else want to be vip then you can all right i'm going to add gris that's such a all pay p die three that's you all right you will be a vip as well so save um skorita is already a vip i will add mataklal since we got some more mataklal is you add roll you will be a vip as well risk fun all right you will be a vip as well david sharp thanks for the lecture have to leave yeah see you see you um david sharp i will add you as a vip as well david so just that i can kind of get an idea of people that are actively participating asking questions chai dol add need vip save i will add there you go you are a vip as well and let me scroll up a little bit are there more people all right selina of course all right you are a vip as well alexander is already all right so if i missed anyone and you want to be a vip then just uh let me know all right so general gulag actually has this vip thingy as well okay so i think that um that's more or less it for the lecture um i will just try the rating thing just because i want to um so we will do i have no idea how it works so you guys just have to tell me if it works but in theory how it should work is that um we should all move to um the other channel so in the other channel is a bunch of little chickens sitting underneath a lamp so we're just going to do rate channel um all right so this guy very good and then we will start our rate all right so everyone's ready 13 viewers and then all right so 12 people are going to join me to the little baby chicken cam all right so right now this should stop my stream i think so i am wondering