 My name is Boris Staipa, I'm going to ship her to you through this introductory workshop on exploratory data analysis with R. My original background many, many, many years ago is actually in medicine, but it devolved rapidly from that. From medicine I went into molecular biology and then into protein biophysics and protein engineering and trying to predict aspects of protein folding with computational methods and becoming more and more interested in more abstract ideas of bioinformatics and computational methods. And really coming from protein folding and protein engineering becoming interested in complex systems. What makes systems complex rather than just complicated? Why is it so hard to predict how biomolecular systems work? Why is it so hard to predict how proteins would fold? I mean all the information is there in the sequence, but why is it so hard to get it out of the sequence? And ultimately this led me to some interests in why does evolution work in the first place and what does it all mean? I mean literally what is the meaning of biology or in biology? And there's an interesting parallel in the concept of meaning because if we look at how evolution works we can kind of equate the idea of meaning with the idea of function. So function in biology i.e. in the sense that it is that which is conserved on the evolutionary landscape and provides the selective advantages, we can certainly to some level say well that's meaning of that protein. Now if I would ask the 30 people in this room to give me a definition of function I would maybe get 35 different definitions, some of them compatible, some of them not compatible. And even though this is a word that we very very often use, we don't actually have a definition for it. We use surrogates, we use things like what are the go terms that are associated with a gene and that's very convenient, somebody has told us what the function is or how we could label the function. But it's more subtle and more different than that, even the same protein of the same sequence can have slightly different functions in different organisms. So function always arises as an interaction of whatever the protein is and does with its context and its environment. So we're in a domain of uncertainty. And this is why we need things like exploratory data analysis. The main idea of exploratory data analysis is we're using numerical analysis and graphical presentation. Graphics is very important in exploratory data analysis. Without a particular underlying model or hypothesis in order to do things like uncover underlying structure in the data. Looking at some particular feature in the data, is it all just amorphous distributed in the same way? Or is there something there that is different from what I would expect from a random model? And if I can find something that's different from what I expect, from a random model or something, that always means there's something operating that makes it so. And that's usually what interest is. We can define important variables. Our data, if we have numbers associated with our data, usually there are large numbers and there are small numbers. And what aspects of our data make some features come out having large values and some features having small values. That's the domain of regression analysis. We can try to detect outliers and anomalies. So the idea of averaging is not just to say, well, what's common among everything, but also to be able to say which of my data elements are very different. And then being able to ask, why could they be different? Why don't all genes behave in the same way? Then to detect trends, of course. And that's often in the kind of applications that we do with the question of how can we make predictions from our data? And then perhaps develop statistical models. I think we'll talk a little more about what a statistical model means. And we can test underlying assumptions. So the goal of exploratory data analysis really is hypothesis generation, not hypothesis testing. And that has a lot, being able to generate hypothesis, has a lot to do with creatively playing with your data. You don't really know what it is going to tell you. You have to poke it until it starts telling its story. So therefore, it's very often the first step of analysis. Once you have a hypothesis, you can throw more sophisticated statistical tools at it and find whether your hypothesis is statistically validated, i.e., whether, say, the predictions that would make are statistically significant. But in order to be able to do that, you need a hypothesis to start with. So in practice, things we do in EDA is computing and tabulating basic descriptors of data properties, such as ranges, means, quantiles, variances. These are things that we will start out with this morning. And then we can take these numbers and generate graphics from them, box plots, histograms, scatter plots, see how they relate to each other. We apply transformations, such as log or rank transformations, to remove some of the numeric variants and bring out features more clearly in our data. We can compare observations to statistical models, such as QQ plots. We'll be doing some of these rank plots or quantile plots. We can simplify data. We can use clustering to find underlying structure. And all with the final goal to find statistical models or better ideas that we can then use as appropriate for hypothesis testing and prediction. Now, R is really very excellent for that, not just because it's a full-feature programming language and it has a lot of extensions, so it can work as a statistical workbench. And because it's in fact actually a language to write domain-specific languages in very many different domains and much work has been done to write this subset of R suitable for high-throughput molecular biology with the Bioconductor project. But one of the very useful things of R is it has easy access to graphics. So plotting routines, graphical routines, ways to visualize data are really inbuilt. And that is a factor that is usually not as well integrated with the language in other languages that are being used for data analysis. The packages and libraries, i.e., contributed code, are very, very well vetted and they are very sophisticated. And there's a large community and a growing community. And that's important if you ever have questions you will find experts in the field. So maybe 10 or 15 years ago almost all of Bioinformatics in the lab was done with Perl. Then people started realizing Perl is kind of awkward and not so... It has its deficiencies if you want to write rigorous and complex and reproducible code. So people started to embracing object-oriented paradigms like working with Python, which is a very excellent scripting language which has a very great following. But I think currently much of Bioinformatics is done in R. Interestingly, we tend to work with these so-called scripting languages much more than with compiled languages. So maybe this is changing a little bit in the age of high throughput genomics because if you can have compiled code for a problem, it may run up to 100 times faster than if you're using a scripted language such as R. Now, therefore many of R's packages are actually use compiled code in the background and then communicate for speed with these compiled elements. So you can get the best of both worlds there. But what's really important for R is that you can put your programs into scripts and you can work interactively with the scripts line by line and you can develop things and you can change things very rapidly. So it's extremely flexible. And as a software development paradigm in the typical molecular biology laboratory, this is extremely important because we don't build large static applications that are going to be around for years and years and years. We build small, flexible scripts to solve individual problems. We write up the results. We send them off. We hope to get them published and then we move on to the next project. So our software development always works against moving targets. Once we've solved the problem, we put it away. It's no longer science to solve it again and again and again. We solve a new problem. And this is why in many biomematics laboratories, we basically just use small scripting languages because they're most flexible to mix and match and combine and patch things together and build new things. So this is why R, or some of the reasons why R has become so successful in bioinformatics. Graphics, yes. Good graphics are immensely valuable. When is a graphic good? Well, that's actually quite simple. In a good graphic, every displayed element will support the information that you're trying to convey with the image. You don't put in extraneous things. You may highlight some things with color, but the more graphical elements are in your data, are in your plot, that don't directly convey the message. The more likely is that your message is not going to be transported by the graphic. There's a lot of good examples available, but fundamentally there's just one simple and easy rule. Use less ink. So the least possible amount of ink that you can use to convey a particular message, the more likely your message is going to be received. So this means make sure that all the elements and graphics are necessary, all the elements and graphics are informative and all the information in your data is displayed. But not all of R's defaults use as little ink as possible, and if you use a package like GGplot, you're going to teach GGplot Lauren, right? That's awesome, because I usually don't. This is an example from GGplot, a so-called bubble chart, and things become very, very, very easy there. Maybe tomorrow morning after you've had the GGplot experience, we can do a little bit of discussion. Why I don't use GGplot and why you will probably find it awesome. Can you have that presentation? Can I full screen it? It becomes so inflexible then. I make it a little larger. You don't actually need to read it. I make it a little larger. Okay. Right. So here's a different example of graphics. This is taken from an article from the internet from Microsoft TechNet. And the caption was about plotting capabilities of Microsoft Excel. And what it says is, what if you could gussie up a report or pretty up a chart without much additional work? What if, using just one extra line of code, you could create a Microsoft Excel column chart that included a cool gradient fill like this one? Okay. And then we can discuss, doesn't become much better to see whether we can count the ways in which this graphic is wrong. Like wrong with a capital W. Does anybody notice something odd here? So, my God, what does it even say? So these are numbers, something about system usage and these are different Windows operating systems. And numbers of computers. Okay. So, assume this is a number of operating systems that are installed on the numbers of computers. Numbers of computers would be here. And this is Windows, whatever, and Windows XP and Windows 2008 and so on. Why is it in 3D? Yeah. Is the 3D effect helpful? No. Actually, it's distracting, right? Because it becomes much harder to see what is this number? Where does this number go to on that scale? It becomes quite difficult. How many numbers have been used to generate that plot? Five numbers. So this is a chart that represents five numbers in five different categories. What about the cool gradient fill that they were talking about? It doesn't give any extra information at all. It's just painting of these stacks. See how the black and the black don't line up? It's just painting of the stacks. Sometimes we use gradient fills to emphasize in a 3D plot where in the vertical dimension we are and so on. So, since the question was what if, I think the answer should be that would be pretty terrible. So, these are actually only five numbers for such a huge plot. There are no units given. If you are a graduate student and I happen to be in one of your seminars and there are no units in one of the graphics that you present, I will probably call that out. So don't let that happen. The colors are meaningless. They don't have any connection to the actual scale and the range differs for each stack. The 3D doesn't mean anything and makes it different. So, the point I'm making here is if you have very highly sophisticated graphics capabilities that does not absolve you to think about your data presentation and your information design in an equally highly sophisticated way. And the more you can do, the more restraint is appropriate to get things right. So where do we learn from? Where do we find exemplars? I find two excellent sets of information on Reddit. So, the first one is that you find two excellent sets of information on Reddit. So, I try not to go on Reddit at all if possible because it's such an amazing time sink. Anyway, don't get me started on procrastinating. But this Reddit visualization is quite nice because you have data visualization examples. And the good thing is you don't only have the data visualization examples, you have comments about it. So people comment there and call out what's good and what's bad about it. So from following this kind of discussion, you can start developing an intuition of how to make effective plots and make effective data. Similar here is the subreddit data is beautiful. So this often takes current data visualization results that people have recently posted and that maybe made the news or there are some other interesting. And once again, the comments here are what really makes it powerful. Okay. So, very vaguely, if we talk about statistics, we can divide the world into descriptive and into inferential domains. So descriptive statistics tells us what the features of our data are. Inferential statistics takes statistical models and makes predictions about what the features in samples from the population that we have not yet observed could possibly be. So we'll start out with some descriptive statistics on the data that we've looked at. And, oh yeah, and all in all, I would like to place these two days as I did the day yesterday under this one motto. Not knowing is the most intimate. And here's the story that I pulled out of this classic book of equanimity. Attention master Jizo asked Hogan. So Jizo is a very famous about Zen Buddhist and Hogan at that time was a pilgriming monk. Where have you come from? And he answered, I pilgrimage aimlessly. What is the matter of your pilgrimage? I don't know. So to a certain degree, this could apply to you as well here. Why are we learning exploratory data analysis, for example? And then Jizo replied, not knowing is the most intimate. Now, to a certain level, you could think of this as a contradiction. Many of Zen quans are contradictions that simply are meant to make you think about it. But I think applying this to this workshop here is quite appropriate. Not knowing does not mean knowing nothing. You all know a lot. You know a lot about biology. You don't know that much about exploratory data analysis with R. I presume, maybe you know more than I do, but I think you have a reason to be here. So in that respect, you're both in the state of knowing and in the state of not knowing. But what makes anything we learn today, or we talk about today and tomorrow, meaningful at all, is if you are able to relate it back to what you know. It makes it meaningful if your knowledge can be expressed in new ways. If you can drive this whole exploration with your insights about biology. And I'll be calling you out a lot. I'll be looking at data with you and the question is always going to be, come up with some idea of the kind of questions we could ask. And then let's talk about how we can take such an idea and translate that into code. That's ultimately what we are going to do. And in the end, I hope achieve some great enlightenment. So let's get started with that. The Bioinformatics.ca EDA 2017 page has instructions on how to load our current RStudio project for today. So if you've been here yesterday, it's exactly the same procedure except with a different project name as the repository URL. If you knew here, it works in exactly the same way as our introduction to the environment project that I hope you have done with the introductory R tutorial. So the first step that already should have been done is you should have created a project directory somewhere on your computer. If you don't have a project directory yet, make one now. If you don't know how to do that, Lauren and Greg will help you. But start by having a project directory. Then open RStudio, select file, new project, click on the version control, click on Git, and then enter HTTPSgithub.com, Huigin R EDA introduction as the repository URL, type tab character, this should then auto fill, and then find your project directory, click create project, type init into the console pane. And when you're done with that, please show us your blue or green or bluish posted so we know you're done. And we'll reconnect at that point at this stage here. Everything should look approximately like this. As a little check whether everything is correct, you should have a file called myscript.r at the bottom of the list of files in the files pane. So all of these little sections here are called panes, pane as in window pane, P-A-N-E, not as in P-A-I-N. And myscript.r, if you click on it, gets opened in the editor pane. And I've provided this as a file that you can work with and edit and put in your notes, treat it as your lab journal for today. You can write in R code and try things out and play around with it. And when you save it, it's not going to be overwritten if we update the version of the project at some point and you pull it back from the GitHub server. So this will not get overwritten. This is why it's in a special file. So if you put comments into the main script file, our EDA introduction, and then I post an update, your comments will be lost or worse if you've committed them, our studio will need to require you to reconcile the editing conflicts in order to load anything at all. Okay. And there's a table of contents for the different sections if you need to reconnect. Using the line numbers gives us a good reference of where we are in this code. So let's look at project files. What's in the box? This is a typical RStudio project. At the top is a file called gitignore, which tells version control which files in this project are not under version control. So sometimes RStudio builds temporary files and these should not be saved in the version control database. There's the file init.r that contains a function which creates this myscript.r file at the beginning. There's a file called rhistory. This is empty. In this file you can store previous commands. This may be useful for looking up commands, but in order to actually do some work on a particular project we keep all working commands in the script file. So the script file is actually your most important project asset at home. In principle you could copy and paste things from Rhistory, but of course Rhistory contains everything that we type on the console or that gets imported into the console. So it also contains all of the errors and all of the commands that inadvertently broke your data and so on and so on. So working with an Rhistory file has its problems. There's an RProfile, which is the first file in a project that gets loaded when RStudio starts up a project. So in your own projects you can use RProfile files for project specific configurations that you might want to do. So for example, RProfile, when it gets run, it sources the file init.r. So when RStudio starts this project this file init.r gets sourced. It's a file that I've written to configure or help you configure something. The file init.r that gets run has a little command here which says if the file myscript.r does not exist then make a copy of the file tmp.r which is part of the project files which I've uploaded to GitHub and call it myscript.r. So since this file is newly created in your project, it's not under version control in whatever I do. And this is why it does not get overwritten when you download a new version. And then it puts REDA introduction, i.e. the script file, into the editor. So this is how things start when everything starts up. What else do we have? rrefcards, rrefcard.pdf. Let's have a look at this. So this is a file, pdf file, six pages, very dense. That is a summary of rcommands. So if you know all of these rcommands, you know way, way more than I do. It helps you kind of get an idea of what's out there. And I think this is most useful as a concise, compressed version of commands that exist. So at some point it might be useful to just read through it and then try to get an intuition what's easily available and what's perhaps missing. This may help you when you need to solve a particular problem to know how to phrase a Google query or how to phrase a request for information somewhere. So the idea is not really to learn this by heart. And the idea also really isn't to use this as a reference because when you code in the rstudio IDE or integrated development environment, you have all these references at your fingertips. So for example, if I take a random function here, say summary. Once I type the first three characters, rstudio tells me which commands possibly could continue my first three characters. So sum and summary. And then I can either click on it or if it's selected, use the tab character to extend it. And then all our functions take usually a number of parameters, something they work on or something that allows us to modify their commands. And in the editor, if I hover over a command after a short time, a little command signature or function signature is displayed. And that says the name of the function and what typical options or arguments or parameters could be passed to that function. So in this case, we get a summary of an object and the three dots mean there are many other parameters and arguments that we could possibly use. If we want to learn more about what these are, you can use the help tab in the assets pane. And you can type summary. No, here. And get information on the usage and the arguments. So object, maximum sum, digits for formatting, additional arguments affect the summary and that might be described in details. The other thing is value. Value is always the return value of a function. So in order to use a function, you need to know what you put in and what you get out. What you get out is called value. It is what is used in a function, what is returned with the return statement of a function. And often there's also examples. Okay, so that was the ref cards. There's a separate ref card file on data mining that might also be useful. There are utility files. One is called read s3 and one is called type info. So type info is a little bit of code that we discussed yesterday. It simply allows you to look inside the internals of an r object and figure out how it is structured, what it contains, what its attributes are and get all of that information very quickly. So we'll use that from time to time if we want to look in more detail at the internal structure of our objects. Templates that you can edit and use later on, one for how to basically write scripts in a meaningful way. Just as a suggestion of how your own script files could be structured, you write down the purpose, version date and author notes. One early command that is usually always good to have in script files is to set the working directory to your project directory. We'll talk about that in a minute. Then to define and explain parameters and then to load required packages. So if there are additional packages that you use in your code, this is a code paradigm that I like to use in script files. So fundamentally if you want to use a package, you need to go through two steps. You need to install a package once until you have it on your computer. So if you run the function, install that packages and then give it a string of a package name, R will access a server at CRAN, C-R-A-N, the collaborative archive network where many R packages are stored and made publicly available. So the good thing about CRAN is that not just anybody can post stuff on CRAN, all packages on CRAN are extensively peer reviewed. They actually work and they won't do anything malicious with your code. So they're peer reviewed, they are constructed from source and they are carefully monitored that the package that's being distributed is byte for byte identical to the package that was originally committed. So if you do that, then the package gets installed. We'll probably get to, yeah, we actually will get to installing some package rather soon. And then in order to make it available in your current R session, you need to load it with the command library. So if we install the package R unit, then we can use the command library R unit to actually make the commands and the functions from within that package available in our current R session. Now in a script, I need to be sure that if I run a script and I use package functions and I want to source that script, I need to be sure that the package actually exists on the computer but I don't want to load it via the internet every time this happens with all of its dependencies and this can be a slow process. So I don't want to load it every time but it has to exist. And in order to do that, I use the require command. So require command does the same thing as library but it has a return value of true or false. So require loads the package and if the package does not exist, require returns a false. So if the package does not exist, this negation operator returns the false into a true. This means my if command applies to a true condition and then the inside block install packages and our unit and library are being executed. So this means if the package does not yet exist on my computer, it gets loaded from the internet and installed. Now while it evaluates that condition, of course it runs the require command. So if the package already existed, it gets loaded at that point and nothing else is done. So basically what this little if condition does, it loads the package from the internet only if it's not already on my computer. And the quietly equals true simply turns off the warning message that it didn't exist. It just works in the background, in a script. The quietly equals true is not evaluated interactively. So this means if we actually source this command, then sorry, if you source the command with the script file, nothing gets displayed. But if you run it interactively, you will get a warning that the package didn't exist. So this can be confusing because the warning comes up after the package gets installed. We'll get to that. This just to explain a piece of code that you will encounter often in the scripts we work with. Then we have a section of functions and a section where you have the actual process of your workflow. A section where you might put tests, testing is always crucially important. And a line called end, I frequently share code by email or by attaching it. If I always include a line that I structure like this at the end of the script, I can be sure that the entire file has been transmitted or that the entire file that I received is complete. If I omit this once, it defeats the entire purpose of having it all the time. So this is not a language or syntax requirement. It's just a convention that I like to use to make sure that the files that I pass around and share with other people are and remain complete. So in my scripts, this line is always there. Somewhat similar is this file function template. If you write your own functions, you can copy that, for example, with our studio by checking this little box and under the more menu item, tell it to copy and then put in a new name. Try not to overwrite the old name. And that has basically all of the components that you would need to keep a nice, commented, structured piece of code for writing your own functions. So this is just there to make life easier for the future. What else is there? There's some data there which we'll talk about. Oh, and there's two papers. The papers are, one is Giten et al. 2014 talking about single cell RNA analysis of hematopoietic cells or cells from the hematopoietic lineage. And that's the source of some of the data that we'll be using today and tomorrow. So the paper explain it is there. Since it's copyrighted, as is the other paper, I've put it into a password-predicted zip archive. And the password is C-B-W in lowercase. Vice Gerber et al., we might discuss this in tomorrow. We might not get around to it. It has some good ideas about visualization and what to do with bar plots and when not to use bar plots. If you have ever sat through a biomolecular presentation of some sort, and I know all of you have, the great graphical paradigm there is the bar plot sometimes as a whisker plot or we also call them dynamite plots with little fuses sticking out at the top that make everything blow up. And Vice Gerber discusses how this is misleading and what some better alternatives are. And if you like, we could perhaps discuss how to implement some of them. So this is some ancillary information and the data that we will be working with. Now to confirm that this is there, if you type the command getWD or get working directory, the computer R should tell you where in the world you currently are, where that directory is. And then if you list files without additional parameters, you will get the directory, the listing of this directory. So whenever we do something like read a file name or write a file name or list files or whatever, in order for this to work correctly, in principle we need a fully qualified path. Like on a Windows computer it would start with C, colon, backslash, my files, backslash, R project, backslash, cbw, backslash, and then whatever the file name is. Except if we don't give it a path but only a file name, the source of the file is the current working directory. And the current working directory can be obtained by typing get working directory. It can be set by set working directory where I enter some directory into a path. But it can also be set interactively in the session menu of RStudio. And it can be set in different ways to the project directory to the source file location or to the files pane location. So for example you can then browse with the files pane somewhere else to some other directory by clicking on the double button and going up and down. And then use the set working directory to set the working directory to that directory. So it's relatively easy to navigate around and you really need to type the entire path I need to find back. Yes. Okay. So this is get working directory, set working directory and the various options. In principle when you load the project, the project should have set its own working directory to the correct project directory. That's the way I set it up. Of course it might work slightly differently on Linux systems or on Windows systems and I can't really test that. But I think it works. Nobody, at least yesterday, nobody complained that the project directories were not in the right default working directory. Now let's start with loading some data. It's the same data that we worked with yesterday. But I have it again here. The way the data was loaded is in a script read s3.r because the data is based on the so-called supplementary data file 3 that is posted on the science website which accompanies this paper. And then after studying the script you should load the object lpsdat which is part of the project directory in a compressed rdata file lpsdat.rdata. So open the script, have a look at the script. It's slightly modified from what we did yesterday. We discussed it would be easier to skip lines that we don't need rather than removing them and that's reflected in this updated version. And there's a note at the end on how to load the object and I would like you to load that. And when you're done and lpsdat is loaded as an r object with 1300 something files and I believe 1600 something rows and 16 columns put up green posted so we know you're there. Now some of you are very new to this. Some of you are rather experienced with r and quickly done and you don't have to be bored and you don't necessarily have to go and check your Facebook or Twitter feeds. Terrible as that is. You can do other things and I have two suggestions here. One is reproducible research requires testing of code. In order to make sure that your programs are doing what they ought to be doing, they don't only have to be syntactically correct, they also need to produce the correct results. And this is why you should get into a habit of writing tests, tests, tests whenever you do any kind of analysis. The test will take your analysis objects and confirm that the code that you've written produces results that correspond to what you know the truth should be. There's an excellent package that supports that. It's called test that. And you can install and load the test that package from CRAN, explore the vignettes that it contains, explore the function that it contains and you can try using test that functions and test that assertions to as we go along simply write some tests into your myscript.r file to confirm whether something worked or didn't work. So if we're working on a task and you're already done and you feel you want to do something more than something else, write some tests. And the other thing that you could do if there's some slack time that you want to pick up, there's a tutorial page in some of my Bioentomatics teaching about regular expressions. And I don't think we will be covering regular expressions in this session here except very, very briefly perhaps, but they're really, really important. When you work with any kind of textual data and you need to change things and massage things and move things around, regular expressions are your friend. And you can learn more about regular expressions at that time. And if all of that is already old hat to you and you're no longer interested and you want to do something completely different, you're welcome to replace me and teach part of this course. Can I assume everybody who doesn't have their post it up just didn't put their post it up or is there anybody still struggling with this? Okay, I think we're all done. So once you've executed these commands or just loaded our data file, in the environment pane under the data section, LPSDAT is then made visible. There's a little arrow here which we can use to look into the structure. The arrow tells me that there are columns colored called by particular names. So there's genes, b.control, b.lps, mf.control, mf.lps and so on. It tells me the type of the column. Remember this is a data frame and the values in one column all need to be of the same type. So the first column has the values of type character which are the individual gene name strings. All other columns are numeric types which are these numbers here. And if we want to inspect what this data set actually contains, there's a little icon here looking like a spreadsheet. If we click on that, the data set is shown to us and we can check whether it contains what we think it ought to contain. Okay, now just to recapitulate something we did yesterday, here are three tasks, same as we did yesterday about sub-setting, i.e. picking out values of some kind that are found in our data set. So pick either this one which is the simplest or pick this one which is a bit more involved. You should be able to write that expression in one line and it's quick to do. And I will walk through this one in two minutes or so. Bless you. If you have any troubles or difficulties or need to ask questions, remember put up your red post and somebody will fly in and take care of it. So did everybody get something? I hope. Just for illustration, I'll walk you through this example of gene names and expression values for monocytes that are stimulated by LPS for the top 10 expression values. So how do we do that? Well, first of all, we need the expression values for this column here and we find that in this column. So we address it LPS.$mo.lps. If I type that, I will get all 1300 expression values. I don't want all 1300 expression values, so I just look at the front of them with head. Okay, so I'm getting numbers here. So now I want only the top 10. How do I get the top 10? So I could sort them and then pick the first 10. But if I sort these numbers here, I would get the 10 highest numbers but I wouldn't find, I wouldn't know which genes they are. So I would lose the relationship of the positions with the genes. So if I do something like this here, the head of the sorted values, I get the highest values or in this case the lowest values. But I don't know what these genes are. However, I need to put them in decreasing. It's true. So I'm getting these values but I don't know what the genes are. So I don't use sort on problems of this type but I use order. What order does is it gives me the index of the element according to its sorted order. So you see that the highest value here is 6.1 or minus 6.1. If I use order, order tells me this corresponds to the elements 205, 2, 367 and so on. So basically I should predict that element 205 in this column here has the value minus 6.1. And that's what I need to solve this little task because if I use the value 205 from dollar genes, I have the associated genes name, H2EB1. So this is the top, the single top expression values. Okay, so I can put that together. I can say get me from LPSDAT. I want all the rows which have ordered by expression values in a decreasing order. And in terms of columns, I want the column genes and the column monocyte LPS. But now actually I don't want all of these. I only want the first 10. So I could put that whole vector into an auxiliary vector and then just pick out the first 10 elements of that auxiliary vector. But I can also do this and attach that right to the expression. It works in the same way. So I don't need to make an intermediate assignment. So if I haven't messed up, this single statement here gives me the gene names and the expression values for monocyte LPS cells for the top 10 expression values, row 205, 6.4, and so on. Could you extend this for the most differentially expressed genes? How would you do that? Where would you need to change? What do I even mean by changing? So right now I'm sorting by the highest absolute values. So you would order by the difference between control and... I would order by the difference between control and stimulation. So how do I get the difference? Subtract. Okay. How about highest differentially changed where I include both repressed and simulated? Exactly. So I subtract them and take the absolute value. How do I get the absolute value of something in R? What would you guess? Abs. Do you think so? You're right. Now if you wouldn't have guessed or you would have tried and it didn't have worked, what would you have done? Google. Does it work? Okay. So here in this third link, without even clicking it, I see the abs function. So this is, again, coming back to this theme of not knowing. You might know what you need to do. You might not know the exact R function, but it's easy to find it. If we try to sort on the absolute value of the difference between two columns, the difference is that other brackets exactly that one. Exactly. So here. I mean, I guess there's no need to make another column and then order that column makes the rules to compare the... Okay. So that's actually an interesting point. When do we make things as intermediate assignments and when do we just put it all into one state? If things are very straightforward and can never go wrong and don't need to be troubleshooting a lot, then we can just put it all into one state. It's compact and, however, if you need to comment it, because it's not obvious why you're doing a particular operation, if you need to troubleshoot the individual steps or just confirm that they're valid and so on, then it's a better idea to go through intermediate assignments. So I can... I guess I misphased what I was asking. I guess what I'm trying to say is that it seems to me as though from a non-confrontation perspective, I would track two columns or it would need to be stored on another column that has the same indices for the rows that would not coordinate. I have the appropriate assignment. Exactly. So you have the correct indices. But we're not doing any of that. Yeah. Right, so here we have the highest differential effects. So this is the highest in terms of... This is the highest in terms of overall expression. Note that our minus 6.1 is not in this list at all anymore. Why? Well, it might have been very high, but it might also have been high in the controls. So there wasn't anything differential about it. So this is being stimulated. These two genes are being repressed. Now, if we would be doing an intermediate assignment, I could do it in this way. I assign it to some column TMP and then do the same thing with TMP here, where TMP now is a numeric vector which contains all of the individual values. All right. So just to get our juices flowing, we've loaded the data set, we've looked into the data set, we've taken values from that data set and we've done small manipulations and we've ordered things and selected top 10, and I hope if we do that again and again and again, it will start becoming more clear than it was yesterday. So let's start talking a little bit about descriptive statistics and start making some simple plots. So descriptive statistics applied to that data set would ask things about what's the distribution of our values? What's the distribution of differences? What are the extremes? What's the highest value? What's the lowest value? What's the most typical value? And this kind of thing. So to look at that, usually it's a very, very good idea to start with synthetic data, i.e. data for which you know the properties before you even start doing the analysis. That allows you to verify that your understanding of the analytic procedure is correct. If you get the wrong result with synthetic data, you shouldn't believe that you can use the same procedure to get right results on your real data. So your work with synthetic data in wet lab terms is a positive control and you always should be running positive controls. You have to treat your computational analysis just like a wet lab experiment. There should be positive controls, there should be negative controls, and there should be a lab notebook. But the positive control, one of the positive controls that we often use is a normal distribution. R has a number of in-built probability distributions and the normal distribution is one of them. So the command rnorm gives us normally distributed data where we can define how many data values we want, what the mean should be, and what the standard deviation should be. But within that they're random. Now one more thing that I usually do when I use any kind of random values in teaching scripts is to set the seed of the random number generator. Because it could just happen that if I run this command here, I can get 50 values of exactly zero. Probability is very small, extremely small that this would happen, but it's possible because they're random numbers. Zero is in the range, so 50 times the same number could be the outcome of a random experiment. So in order to prevent things like that, really anomalous data cropping up and then causing great confusion to happen is I set the seed of the random number generator. And this means since the random number generator does not make truly physically random numbers, but so-called pseudo random numbers that have all the properties of random numbers except they are generated by an algorithm, if I run the algorithm the same time over, I will get the same results. So let me illustrate. If I set seed to one, two, three, four, and then I throw a six-sided dice 20 times. This is what I get. One, four, four, six, four, one, two, four, four. Would you believe that's a fair dice? Fair dice? So many fours? Probably loaded. So if I do this again, of course I get different numbers. Incidentally, this is one thing about random numbers. If you have truly random numbers, there's usually many, many more repetitions in the random numbers that we as humans would do if we try to write down random numbers. So we tend to make random numbers as different as possible, but that's not true randomness. Okay, now we have two sets of throwing dice and for good measure we throw in another one. But if I set my seed value back to what it was before and then sample again, I get exactly the same set of random numbers that I got the first time. So for all practical purposes they're random, but they're reproducible. And this ensures since you all are running the same random number generator on your machine, if you set the same seed that I set under the script, we all have the same set of normally distributed random numbers. So if we set seed to 100 and then do this x value, the first value should be minus 0.5022 on all of the machines in the room. Any counter examples? Okay, very quick descriptive statistics. Many of the keywords of descriptive statistics are simply used in that way in R just like apps for absolute values you can use mean. That's our mean value. What do you expect it should be? 0. Do you expect 0? It's from 0 to 100, right? So these are 100 normally distributed random values where the normal distribution is taken to have a mean of 0 and the standard deviation of 1. So do we expect 0? We don't. We don't expect 0. We expect a small number, but it would be quite unlikely that we hit exactly on 0. But we expect a small number. If this would come out as 21, I would be very much shaking in my confidence about the random number generator. So it's a small number. Media. What's the difference between mean and media? Asma. What's the difference between mean and media? Did you ever come across that? The number at the 50th percentile. Okay. That's absolutely correct. Media is the number at the 50th percentile. What does that mean? I have 100 numbers. It's done more by ranking than the actual numbers. Exactly. So I rank them, just like sorting them or ordering them, and then I take the 50th ranked. And there are many different ways of how to break ties if these are, if I have an even number of observations. So if I have 101, then I just take the 51st element, and there's 50 larger and 50 smaller ones. If I have 100, I need to decide what I do with the 49th and the 50th element. So do I average them, or do I take the lowest, or do I take the highest, or whatever? So there's different ways to compute means. Important for statistics, but usually we just use the default here. Why media and not mean? Some situations media and a typical example is the income. That's an excellent example. That's an excellent example. So what does it mean if I take the income of the population of the United States and take the mean? Right. It's heavily influenced by a small number of individuals at the high end. And I would get a number and I would look into the population. Nobody has that much money in their pockets. Virtually nobody. What does the median mean on the other hand? That means whatever the distribution is, the highest number of people have approximately that much income disposable. So sometimes the outliers are important. Sometimes we want to suppress the outliers. Sometimes the distribution is heavily skewed like the income distribution. And in that case, we might want to use some kind of a transformation like a log transformation to first bring things on a more comparable scale before we calculate means or medians. But note that these two values exist. They give somewhat comparable information, but you have to be aware of what the difference is. What's IQR? Let's even stand for IQR. Interquartile range, exactly. What's a quartile? It's a concept that's closely related to the median. So if the median is the 50th value or the value at the 50% market once we order them, the interquartile range is the range between the 25 and the 75%. If we use the summary function, we can see these things here. So the median is minus 0.5. The mean is a small number. The first quartile is down at minus 0.6. The third quartile is up at 0.6. So the interquartile range is the absolute value of this plus the absolute value of that or the absolute value of the difference. Variants and standard deviations, the root of the squared difference of the value is with the mean. So the standard deviation is close to 1. Again, this is not surprising since the standard deviation of this value has been given as 1. Okay. Now, let's apply this to our LPS data. Which of the columns has the highest standard deviation? Can you solve that task? What do you need to do? What's the first thing you do? So conceptually, we need to know how do we calculate the standard deviation for one of the data columns and how do we repeat that for all of the data columns? So let's solve the first task first. How do we calculate the standard deviation of one of the data columns? Let's do it by hand. So explicitly. There might be built-in ways to do things more explicitly. But if you use built-in functions, the upside is that if there are tricky things about the theory of how to do that correctly, they will be taken care of in the R code. The downside is you need to remember what there is and how to properly use it. So that's always a trade-off. If things are simple, you're probably better off just writing a few lines of code by hand. If things get more involved, you're better off, you know, searching on how to do it correctly and then using a particular package. I find very often on the R Help mailing list that people jump to packages too soon. They try to import a package and use a package before they properly understand what they're actually trying to do. And we always emphasize that. Tell us a little more about what you're trying to do. And very often we find that approach is not sound. And then you have to go back. So the question of what's the right package to use is often an ill-posed question. So let's try to do it by hand. So let's do it just for one column. Calculating the... There's an implicit question, of course, here. LPSDAT contains more than just data. What are our data columns? So we've given them names. Now, if we don't have a vector of these names available, if we want to do things for more than one column, it does become easier to actually use the column numbers. So the data columns we want to work with here are columns 2, 3, 4, 5, up to 15. All right. So calculating standard deviation, what was that command? SD. And we want all the rows, and we want column 2. So we get a number. Now, how do we repeat that for all of the numbers? All of the columns. The n-pays, 2, 3, and so on, and so on. Awkward. We can do this. But if you have 175 columns, it becomes very awkward. Does this work? I don't know. Sometimes these kinds of things work. Let's see if SD is smart enough. No. We do it column by column. Now, so we need to have a way to repeat this command over and over again, and each time with a different value here. Does that sound like something we ought to be able to do? Yeah? But what would happen then is you would get the standard deviation of all of the values. You wouldn't get them individually by row. It doesn't work like math, No, it would then throw all the columns, all the data values together. If you want to do it by row or if you want to do it by column individually, you can do two different things. One is write a for loop that iterates and creates a variable where it does each of these measurements individually. And the other one is to use an apply statement, which works very similarly. So let's look at the iteration. We had for loops in the introductory R tutorial. I think we saw a for loop yesterday, but we didn't really dwell on it. So let's talk briefly about for loops. I sometimes find myself writing phonological typos, where it's typo that's not just fat fingered mistyping, but which has a variant spelling of the right sound. I find that very strange, thinking about how the brain generates that and then puts it into code. Does anybody else have phonological typos? Brandon, you do? Nobody else does? It must come from tenure. Ten years. If you think about it, it's really strange what needs to happen for that to happen. Yep. Okay, so that's the generic structure of a for loop. For whatever variable in whatever vector, do something. And this means it takes the values from that vector one by one by one by one, and then does that something. So for example, the way we often write things like that for i in 1, 2, 7, i for index is very often used. It doesn't have to be called i, it could be called anything. So if I execute this loop, I get a number of squares. And that's extremely useful. So we can use this structure to get all of these standard deviations. We just plug this in. So we want all the columns here. So we had columns 2 to 15, right? And we want to print. What should I type here? What should go into the square brackets? Based on i. So nothing where we expect rows, because we want all of the rows. So not a subset selection. And here, we type i. So on every iteration of this loop, i gets the value that we want. So the first iteration of the loop will give us 2, the next iteration will give us 3, and so on. So these are the standard deviations of these values. Very simple statistics. Now I kind of would prefer if I actually would have the column names there, right? How do we get column names? Where are the column names stored and available? So these are our column names. It's a vector of the column names. In a different way. I'm not going to show you. Ask me later, I'll show you how. Well, actually, what I can do is I can use call names LPSDAT and then use for i in call names LPSDAT. And then it will give me column name by column name by column name. And then I can use that for the selection. But what I can do here, simply for the printing, is printing the call names of LPSDAT in the ith position. The first column name, and so on. Right? So this value for B control, this value for B LPS, and so on. Well, that's awkward. I don't like that. I want them in the same line. So what I should use here is Sprint F. Sprint F. Now we want a text value. Then we'll have a blank. Then we'll have a floating point value. And then we'll have a line break. So we have two variables that we have here. The text value and the floating point value. So the text value first. Then the floating point value. And that should look more nicer. Okay, we don't just produce the value. We actually need to put it to the console using Cat. Cat, like print, takes the text value and displays it on the console. Otherwise, the text values are just created within the loop, but nothing gets shown. This is different when you do something from line by line, then the result is shown. But within a loop, the individual inner results are not displayed on the console. So we need to do that explicitly. And we could do it explicitly with print, but then we would get the square bracket one in front of every example, or we do it with Cat, and then just the bare value is being shown. Just to show you the difference, this is what it would look like with print. So we get the one here, and we get it in quotation marks, and we explicitly see the backslash. And whereas this just shows us the raw strings that we generated with Sprinter. Okay. Our little ad hoc exercise on how to look into the data set and do a little bit of descriptive statistics on the individual columns. If you have any questions to that, ask us during the coffee break.