 Once again, good morning. Can I have a brief show of hands of those who have successfully used our studio to download today's workshop project via GitHub? Okay, so one of the great things that we're going to try and then see whether it works out. I can put some code into the thing that I'm working with, then commit that and then you can pull that code into your own version of the workshop script. So you'll be up to date with things that we change in the script. I hope I missed nothing when converting the scripts from standalone to our studio projects. If there are references to a Viki somewhere, this is no longer current, point it out to me so I can fix it. Everything you need is now in the project directory. Okay, what will we do today? So the topic of today is exploratory data analysis of biological data using R. We'll go on a journey for that. Some of the things that I've noted for myself as objectives is you should be able to load and transform and visualize data. This is basically first recapitulating and then picking up on some things that we discussed yesterday with the sample data set which we downloaded for high throughput expression data analysis. Then as always, this is not an explicit objective but an implicit. I would like you to begin generating ideas about how relationships and data can be explored. We'll be going through a number of examples but ultimately only you know the context in which your own data becomes interesting. And one of the challenges of course is to look at your data with exploratory data analysis and find things to do with it. Get imaginative, get creative about the kind of questions that you can ask. Then how to structure these questions and after structuring them, actually putting them into code and getting results with a programming language. We'll look at some principles of effective visualization and also a little bit implicit know where to go for help. So what's exploratory data analysis? Well as the name says it's exploratory, it's data analysis that you do with rather little formal approaches and statistical approaches, formal statistical approaches to get something very precise and very specific about your data. What you're trying to do is to uncover underlying structures in the first place to define which variables in your data might be important. To perhaps detect outliers and anomalies or to detect trends. Perhaps develop statistical models or test underlying assumptions about whether the data are normally distributed, random or not random. So the idea in exploratory data analysis is really hypothesis generation. At the end of your exploratory analysis you should be able to say this is something that seems to be going on with my data. This is a hypothesis that I'm generating about my data. Now in statistics a hypothesis is often expressed as a particular statistical model which may or may not be correct. And by the end of tomorrow we'll probably talk about hypothesis testing i.e. asking to what degree that model is actually correct or applicable to describe your data. So EDA exploratory data analysis is very often the first step of data analysis. You get your raw data and you just look at it and see what's going on. So in practice the things that you actually do include computing and tabulating basic descriptors of properties. Things like ranges, means, quantiles and variances and we go through our data and actually do that. And then generating simple graphics such as box plots and histograms and scatter plots to look at the distributions of the values which you find in your data. You might be applying transformations, log transformations or rank transformations. You could compare observations to statistical models i.e. you could make a plot to show how the variation in your data compares to a normal distribution or something that you would expect simply from random variations, simplifying data through dimension reduction, identifying underlying structure and so on. Or with the final goal to define which statistical model might be appropriate and we'll go through quite a few of these techniques as the course progresses. R is exquisitely useful for that. First of all because it's a full featured programming language so it's extremely well suited to flexibly work with your data, import it, transform it and bring it into a state that you need to have it in to do any kind of analysis in the first place. It's also often called a statistical workbench i.e. there are a large number of statistical tools, methods and techniques that are either integrated into the base package of R or that can be loaded as add-on packages. All this makes data manipulation easy. My first access to R, well 15 years ago, basically all of Bioinformatics was done, most of Bioinformatics was done, coding things in Perl. Perl is quite nice if you work with text-based data even though it's often quip that Perl is the only language where a program source file cannot actually be distinguished from its encrypted version because of all the special characters and odd characters that appear. But what really struck a chord in me with using R is the easy access to graphics. Graphics plotting things is inbuilt in R and this is so, so important for exploratory analysis that I can't imagine working Bioinformatics in any kind of environment where plotting of relationships would not be accessible at your fingertips. R is a growing programming language. It's become basically the de facto standard of most of what we do in Bioinformatics, not just because of its offerings but also because of our community. So basically this is a virtuous cycle. As the community adopts a particular paradigm of programming language, more tools, more resources, more expertise are being created and what's more people to adopt the same programming language and kind of think in the same standard. As of course our demands grow in terms of the volume of our data and the performance that our computational resources need, our language is growing as well. People are constantly working on making algorithms faster, making algorithms able to deal with larger sizes of data and they're pretty successful. So you can use R to analyze data tables up to millions of rows and thousands of columns in size and get some reasonable performance on that and if you need very, very high scale throughput, there are special versions of R which are specifically designed for high throughput. So the programming language and the expertise has been able to scale up and to evolve perfectly with the demands of the field and this is basically why this is still a growing language, growing in use. So I mentioned graphics and I hope we'll be doing quite a few graphics on data now. Good graphics are superbly valuable and poor graphics are worse than none. I think I'm a very visual person. I think the reason why I ever got into structural biology was because I was walking down the hallway and there were people working on a computer graphics terminal working on defining a protein structure and I looked at them and that was, wow, life changing. And in the same sense, looking at your data through the lens of visual impressions can be, well, maybe not life changing but project changing and really be the key moment of you understanding what's going on. But you have to make sure that your graphics actually show the information that you're interested in and doesn't obscure it. So there are a lot of traps that are related to adding too much color, adding irrelevant graphical elements and so on and people who know very much about design have written books about that. I particularly like Edward Tufty's classic, The Visual Display of Quantitative Information which is a great read if you seriously go into data analysis, pick up a copy in the library and look at that. But fundamentally, there's actually one simple rule that you can pursue and that is use less ink. The less ink you need to display and to produce your data, the more your plots and your graphics are going to be dominated by the actual information that you're trying to show. So the less ink you need, the more important information becomes. Now, R's default plot formats don't always correspond to that rule, but we'll be looking at alternatives and keep that in mind and certainly if you ever feel that we're doing something that doesn't look nice, that may be a great opportunity to discuss on how to do it even better. Using less ink, make sure that all the elements on your graphics are necessary, that they're all informative, but do make sure that all information in your data is displayed. Now, many people don't use the default plotting devices within R, but work with a package that is called GG Plot or now GG Plot 2. So very often in our discussions, you will find plots that have a grayish background with white grid lines and that's a sure sign that that graphic was produced with GG Plot. Personally, I don't use GG Plot and I won't be teaching each year. That doesn't say that it's not excellent and not very useful. It's just that I've been used to do things differently. I have a certain mental investment in working in a particular way, so that's what I try to stick to. GG Plot is very flexible, but it considers itself to be a graph grammar. So there's a particular theoretical background on why it works the way that it does. I can maybe zoom in very briefly on this slide here. So this is just as an example how that syntax would look like. Reading a file and then GG Plot, a data object, defining a particular aesthetics and then you start plot plus, add the geometry of points in red color, plus add a scale of area with particular parameters, plus add the geometry of text with a particular size and so on. So you have an object and you use the plus operator to add attributes or ways to display things with that. If you get used to that, that's a decent way to describe what you're doing here, but it has its own grammar. I think I have one GG Plot example just to make sure that it's loaded and installed and working for you that we'll be looking at later on. Here's a fun example of graphics. This is actually a screenshot from the web from Microsoft TechNet where they write what if you could gussie up a report or pretty up a chart without much additional work? What if using just one extra line of code you could create a Microsoft Excel column chart that included a cool gradient fill like this one? Wouldn't that be awesome? Now, I'd like you to look at this and think of at least two things that are wrong about this graphic, that are terrible about this graphic. Everybody got two things? Or if you don't, you might tell me, okay, that's actually a very nice graphic. That's exactly what we think a good graphic should look like. So here's some problems. Now, this whole data set actually only corresponds to five numbers. So every number has its own stack here. So all this ink is used to display five numbers. There are no units on this scale. There are numbers here, but we don't actually know what these numbers mean. Never, never come across me and show me a plot without units on the axis. Our graduate students do that a lot. And usually in the post-mortem of their graduate student presentations, somebody has something to say about that. Now, this whole point that was being made here about the color gradient shows you that this is actually meaningless because the colors don't have any connection to the actual scale. Note that the black part of this column is up here and the black part of this column is up here. It's down here. So it's just a gradient that paints the stacks. It has no meaning whatsoever. And in that sense, it's actually misleading. Sometimes we have gradient plots where we use color as an information dimension, i.e., we take, well, sometimes quantitative and sometimes qualitative information and encode it in color and then apply it, but that's not what is being done here. Then there's this whole thing with the 3D approach here. Even 2D bar charts can be sometimes misleading because these are just 1D numbers. These are just values on one dimension. So are we talking about volume here or area or even though it varies constantly? But now if we are interested in the actual values that are being displayed here, we can't easily retrieve them due to three-dimensional parallax. So there's this grid, but the grid is displaced from the column, so we don't actually know what the numbers are. Maybe to improve this plot, we should just print out the numbers next to the stacks and maybe then lose the plot and just keep the numbers. And adding this third dimension is completely superfluous because it has no actual semantic meaning. It doesn't describe anything about your data at all. So things like that are things we consider chart junk. I'm not trying to diss Microsoft here even though it's a Microsoft example, but I think this shows an attitude that doesn't respect the data. That shows an attitude that doesn't respect the data, and that's the fundamental problem here. You're trying to display something, you're trying to convince something with features that you draw over your data that isn't actually related to the actual contents of what you're doing here. I guess this is what gussie up or pretty up means and is supposed to do in the first place. In a particular context, maybe this has its justification. In science, I would say usually and probably not. So where do you find your good examples? There's a number of websites that you can find and browse. One of them is visualizing.org that deal with information design. And on a rainy day, if you have a few moments, try to explore these things and browse through them and think about what is the data and what's the key idea of how people are working with the data here to tell a story, to make it speak, to be able to transport the information, at first to extract it and then transport it and communicate the information there. Of course, we don't always have R packages that correspond to each of these individual graphs, but it's relatively straightforward and we'll go through a number of examples to work with R and change the default plot parameters in a way that the information becomes even more tangible. I think these are excellent examples that even though often we have this principle, use less ink and try not to have the graphic get in the way of the information of the data, that information design still can and probably should be aesthetically pleasing and fun to work with. Another example that you might want to visit is datavisualization.ch. Those of you who spend their time on Reddit, there are two subreddits that deal with visualization and data analysis, I try to avoid Reddit. Fantastic time sync. I just don't have the time for that anymore. So, graphics inspiration and advice in the subreddit on visualization or there's a very nice subreddit. Data is beautiful where people show how they can find beautiful or inspiring relationships in data that they look at. Now, rather than go through these here in a very theoretical way, I think it's time to move to our actual scripts. Now, I hope you've all downloaded and installed our first RStudio project for the day. So, we've loaded the file, R-E-D-I-introduction.R. If you don't have it loaded, type in it and it should be loaded. This is a script file that's basically the backbone of this module here. We go through this script line by line. I'll try not to add things as I type or not to add lines as I work through the script. So, by and large, the line numbers you see on my screen here correspond to what you have on your screen as well. If it goes out of sync, you'll maybe just have to add two or three lines. It's not going to be far. So, if we go through a script like this, you will see actual R commands and whenever you see an R command in the script, you can execute it in several different ways. Either, if your cursor is in a line, you can simply press Ctrl-R on Windows or Command-R, sorry, Command-Enter on Windows or Ctrl-Enter to have it executed. So, this is the R command Get Working Directory and if I execute that by pressing Command-Enter, the command gets sent to the console which is this pane down in the lower left and gets executed. I can execute only parts of something by selecting it and executing that or if my commands are in multiple lines, I can execute entire blocks of script by doing that. So, the script is basically the backbone of your work with R. All your commands go in the script. You try to have everything that's associated with your project in a script file. I rarely type things in the console except to try out some syntax or so. Usually everything I do goes in a script and I thus have a workable, reproducible record of my work with data. So, this is why scripts are really important. Now, this all is a project in our studio. The project in this way where we downloaded from GitHub, that's a great way to work with projects that you want to share among many collaborators. You can add collaborators to your GitHub account and then everybody has access to the same project files and can synchronize them and upload and download data and so on. But of course, you can just also put your own projects on GitHub with two caveats. One thing is that GitHub doesn't like it if your data files become too large. Now, it's not entirely clear to me what too large means, but try to be a little bit mindful of the fact that you're probably using this as a free service. And the second thing is that everything on GitHub, except files that are in an account that you pay for, is publicly available. So, your confidential clinical data should probably not go on GitHub unless Bristol Maya Squibb has paid for a confidential GitHub account that they lock down and make sure nobody else is able to access. Everything is nicely encrypted. Now, if you have that, then of course, it's a great example that your most important data lives in the cloud with people who know how to keep it from just disappearing when your computer hard drive breaks. Note that computer hard drives breaking is not a question of if, but when. In my bioinformatics courses, students have to prove to me that they have a scheme for backing up their computer. That's just part of computer literacy. So backup schemes are important. GitHub for projects is possibly one of these backup screens. So when you download the project, a number of things happen automatically. The first thing that happens when you run a project, either when you install it first or when you go back to it and load it through file recent projects is that this little file is executed, our profile. And this is one of the files in the project package where all the files were in. So let's have a brief look at our profile. I just click on it. This opens our profile in the editor. This is the first file that this project executes. It defines a function in it, which simply sources something that I have in a separate initialization file. Now, this is a bit of a hack that I needed to do because our studio uses its own editor. I wanted to have to make sure that this project always opens with one particular file, i.e., the introduction to R. So normally, I would just put a command into my R profile, file.edit, redaintroduction.R. However, in RStudio, this doesn't work because while our profile is being run, no valid editor is available for RStudio because RStudio uses its own internal editor and not R default. So this is why I had to take this process apart. At first, I load only the definition of what needs to happen first and I defer loading things in the editor for the time that RStudio is actually able to process this. Normally, when you work with files at home and you simply have your windows open, you would not need to take it apart. Everything can go into R profile, but the file.edit command doesn't work in this way. And the next thing our profile does, it loads a file type info.R. This is something we developed yesterday. It's a little R script that contains the definition for a function which gets detailed information about an R object. And it prints the object, it performs the structural command, which gives structured information about an object. For example, if it's a data frame, the number of rows and columns and the type of the columns that the object possesses, the mode, whether something is numeric or logical, the type of the class, this is information that R uses to understand whether something, for example, can be printed or plotted and so on. So this function is automatically loaded as your R profile starts up with the source command. So now, if you look into your environment, this function type info is actually available. This is automatic. So this is an example of local customization of your R startup. You put things into your R profile, you load functions that you need all the time. You may, for example, load shortcuts to directories that you need all the time. For example, if your directory is under version control, but you don't want to have the large data files that you use under version control, you might have a separate data directory, which is not contained in your directory, but somewhere upstairs or next to it in your directory hierarchy. And you can then define a shortcut to the data directory in your startup and so on. So this is local customization. Of course, you are free and welcome to edit this for your own needs if you have additional ideas about how R should behave and what kind of functions and files should be loaded. Okay, so this is R profile. There are two PDF files, R ref cards, which have a condensed version of information about R. If you click on them, this will probably, if your system is configured, open your default PDF viewer. The R reference card is a concise information about common tasks and how they are done. Indexing vectors, for example, the main operators and what they need, some key packages, how to work with packages, data conversion and so on and so on. It's probably worthwhile from time to time to review this card to enjoy how much you've learned along the way, what you already know, and be mindful of things that you don't already know. It's kind of hard to look things up if you have no clue that something like this even exists. So a concise reference card like that can help you get a vague understanding of what's out there and that will help you look for information as you go along. There's a similar file for data mining, which I also find interesting. It has things about classification prediction, regression analysis, clustering. So the main functions that are used for common tasks in data mining, i.e. trying to work with unstructured datasets and get some first-level information out of that. So this is for you to study at your leisure or to refer back to, especially the RF card, if you're trying to solve a particular problem and just can't get the syntax quite right. There's a rather large file called plottingreference.org. This plotting reference summarizes many of the types of plots and the plotting tasks. We'll be going through quite a few of examples from this as the day progresses, but there's some more in-depth discussion on colors and examples of how to produce box plots or how to color pie charts. And as usual, you can select and execute and plot things in this file. So this is not just a reference, it's also working our script with examples. What else do we have? Okay, yeah, templates. We discussed script and function templates yesterday. I have two template files in here. A function template, this is something that you can copy and save as you write your own functions into an R script. It embodies the kind of information that we usually want to see in a file and a well-documented function. You put a little note about the purpose, any notes on things that you want to develop in the future or things that don't quite work as they should right now. The function has a name, it often has parameters. The parameters need to be explained and documented and the function usually explicitly returns a result and that also should be documented so that somebody in the future can read and understand your function code. Most often that somebody in the future is going to be you, yourself, half a year from now and it's always so embarrassing if I look at old code files and I notice I haven't commented them and I have no clue what's going on. So, yeah, you can save a lot of time if you invest a bit of time in properly documenting and commenting your code. People often ask, what's the correct amount of information that you have to put into code comments? So that will, of course, change. As you become more proficient with the language, you need less and less comments on what's happening at the moment, what you are doing for them at this time. In principle, the syntax itself, the code that you write, should explain itself what it is doing. What it doesn't explain, because it's not designed to do that, is to explain why it is doing that. The why? Why are you subtracting 25 from your data set? Why are you taking logs and not using things directly? This is the kind of stuff that really needs to go into your comments. The goal for commenting is imagine that you have an undergraduate project student and they need to work with your code, so put yourself in their shoes and imagine what level of information would they need to be able to work with it. That's the level of code comments that you need. Similar to the function template, there's a script template in this project folder. This is for longer scripts that basically integrate functions. This is the kind of script that I use to accompany my actual bioinformatics work on a particular project. Once again, there's purposes, there's version. I summarize what input the script expects, what data files, for example, what output is being produced, and what dependencies it has. Dependencies for this purpose are usually packages that you need or special resources that you need to process your data. The first command here usually is setWD, so set the scope of your file system to your whatever the project directory is. This is usually not necessary if you're working with projects and set within session the working directory to the project directory. If you do that, then automatically the working directory is the one from which you have defined to be the project directory into which this was installed. Right, but whichever way you arrive at it, you should really make sure in your script that the project directory is the correct one, and you can control that easily by simply typing DIR, which gives you a directory of a listing of the working directory that you're currently working in. So if you type getWD, this will tell you what the name of the directory is. Now, it may still not be the right one. The right one is the directory that contains the files that you need. So therefore, getWD gives you one level of information, but DIR tells you whether you're actually not just where you think you ought to be, but where you really ought to be because that's where your files are. The next thing I define in scripts is parameters. Every parameter in the file should go in here with a little comment on what it does and why it is there. Try to completely avoid what we call magic numbers within your script. If some number 42 appears somewhere in your script and it's just a number, this can make it very confusing. Why is that number exactly that? Is it 42 or is it 1441.973, or what's the significance of that? So these things should be collected and should be commented. I put all parameters in one place. Additional parameters might be paths and file names so as to make sure that your script works with one particular version from one particular location of a data file to give reproducible results. In principle, if you have all your parameters defined here, you can then run through your script and to change it for different purposes, all you need to do is change the parameters, essentially. Next, I load all required packages. The particular syntax I use in this paradigm has a conditional expression. So in this conditional expression, the package is installed and loaded as a library only if this condition fails. So there are two very similar and equivalent commands here. One is require and one is library. Both require and library load a package and make it available in your global namespace so that R has the functions and the data available that were bundled with the package. The standard way to work with is to use library. But require is a special version that has a return value of true or false depending on whether loading the package was successful. So in my case here, since I have already installed the R unit package, if I load it like this, it should return true. Well, it didn't actually tell me true. Anyway, it returns true as a value and therefore I can test for success in a conditional expression like this. So this means if this is successful, then this whole loop is not executed. If it is not successful, I step into this condition and install the package first. The main reason why this is not successful, of course, would be that the package has never been installed on my computer. And in this way, I can avoid within script files that every time I re-run the script the computer has to access the internet and download the package and download all the associated information which may be a very large amount of data. So the installation of the package, downloading it from CRAN or wherever, only runs if the package doesn't already exist on your computer. If it already exists, it gets loaded, but if it doesn't, it gets downloaded and installed here. So this is this paradigm that I usually use in my scripts. So can I ask you a question? Sure. I apologize if I'm taking you back a little bit. It might help me to understand how this platform is different from what it is. When we were talking about the directory, I noticed it came to a matter. When you work with Linux, you actually point things into the directory. You actually call up the directory to execute the command. Yeah. Why do you do that here? Why are we doing it? Well, it depends. If you want to execute a file in Linux, either you need to have the whole path and filename of the executable on the command line, or the directory that you're working in has to be on the path. So if you have... you somewhere define the path, which is all the directories that are searched sequentially in order to find files that need to be executed. So if it's on the path, it doesn't need... you don't need to specify the explicit path. Now what we do here, as we set WD, we basically define the default path for everything that is to be found. So when we talk about a working directory, what this actually means that implicitly, all our directory and file accesses are complemented by the path to our working directory. So that's... in that sense, it's actually the same thing. Thanks. Okay. Next step is defining functions. Note that nothing actually gets done yet. We just prepare things still. We define functions. We might source a script file that contains many utility functions all at once. We might define functions within this program file. And then we finally get some work done in the section I call process, where I load my data and transform it and plot it and display it and so on. The goal... the goal really is to have all the steps of the data analysis in this one file so that I can completely reproduce all of my plots and all of my results in one file. After that, of course, it probably gets submitted in a white paper somewhere and a report or to a journal editor and they'll come back and say they want something changed. And if you didn't do that and it's not all in there, you're going to really scratch your head and wonder, well, how exactly did I get this plot? I know I fiddled around a lot with the parameters and, oh, my God, I forgot what these parameters are. Now, however, if the script file contains all of your information, it's really easy to change. You just rerun your analysis from the script and you're done. This allows for efficient work, but it also allows for reproducible research. In this day of where most of our inference in publication and biology comes from computational inference and is just motivated by actual observation, making this computational inference reproducible becomes ever more important. And finally, tests. You should always attempt to add tests. There's a framework R unit that basically allows you to independently test all of your functions and results. It is very seductive to work with data and generate the plot and if they look nice, you're often tempted to accept them. Your goal as a scientist, however, is to be completely critical about the work that you do. Never believe anything you say yourself. Of course, never believe anything anybody else says until you can prove that it's right. So spend time, spend effort in making sure that your analysis is actually correct and add tests. Corollary to that, and I think I just jumped ahead and mentioned this, when I do data analysis, I usually first try to run the analysis on simulated data. The reason is if I work with simulated data, I know what the results have to look like. If I don't get these results, if I don't get the parameters I put into my simulation, then I know there's something wrong with my analysis. Now, if I do this on real data where I don't know the parameters, I would never catch the problem. So this is why usually an analysis starts from simulated data first. R gives you a lot of possibilities to randomize things and pick random values and work with that so that you can be reasonably sure that your analysis actually does what you think it ought to be doing. If you can't retrieve your parameters from simulated data with your analysis, you can't expect to get reasonable numbers from your real data either. So that's a very useful first step. Okay. So these are these two template files. If you want, you can simply copy them into a resource directory. So try to keep a copy of that somewhere and don't change it. What else is in the box? Oh, actual data. There's the table of supplementary data from a high-throughput data analysis paper that we discussed yesterday. And there is also data of graft versus host disease, which we will be using. This is flow cytometry data, which David will be very familiar with. I'll probably call you out and ask you to tell us what it needs. And there's also two papers. Chaitin et al. is what we discussed yesterday, a discussion or an analysis of cell types based on high-throughput data. And Weisz-Gerber is a paper on how to go beyond bar charts, some problems that bar charts have and how to improve on that. That's a really excellent read. It's also fun because Weisz-Gerber et al. provides some functions and code that is linked to that on how to make better plots than simple bar charts, but he doesn't do it in R. So that's a great exercise for you to take these ideas and translate that into an R function on your own. You actually might try to do that in case we are at a task or activity checkpoint and you're waiting for everybody else to finish. Have a look at that paper and see if you can come up with a way to improve bar plots with a substituting function according to these ideas. Now, these papers are zipped. Of course, everything on GitHub is publicly available and I shouldn't be posting copyrighted material in a publicly available repository. So I've zipped it up and this zip is protected with a password. And I hope this addresses all copyright concerns and it's not going to be persecuted by some pinstriped colleague sending her nasty grams and asking for $3 million of damages. Okay, so in yesterday's workshop we worked with a data file and we need to load this data file because we're going to be using it later. So there's a script that loads this data file and I would like you to source this script, read s3.r which is part of the package here and then inspect the object LPS.dat that the script produces. Now, as yesterday, place a green or it's actually yellow but I call them green anyway. So this is yellow green. Place a green posted on the lid of your laptop when you're done and place a pink posted if this didn't work or if you have questions. I'm not going to repeat this checkpoint instruction every single time but usually it's just a single line that says checkpoint. So please, if there are any problems with the tasks and you can't complete something or something doesn't work as expected, put a pink posted on your laptop and I haven't given you the actual command to do that, this is for you to figure out. Source the script file and inspect the object. Now, if you're waiting for others to finish a checkpoint, there are two suggestions. You could study how our unit can be used to write tests so install load the our unit package and explore some of the vignettes it contains or you could go to this URL and learn something about regular expressions. We mentioned regular expressions yesterday. We didn't really work with them. This would unfortunately be beyond the scope but here's an introduction that you can play with and enjoy. So just to keep you busy. All right, let's get on with the task and let me see green or yellow post-its when you are done. There are several ways, as you remember, hopefully to inspect the object. So the first thing is source what was the name, read s3.r and I can execute that. Then the data object exists. My environment pane shows me it has 1341 observations of 16 variables. That's what we expect. Now there are several ways to inspect it. There's an icon here in the environment pane that you can click on and that opens the data frame in a tabular view. We see what the column names are. What I usually most frequently end up doing is simply say head which gives me the contents of the first six lines, the row names, the column names and the actual contents. Or we can use our beautiful function type info and this gives me more detailed information telling me that this is a data frame of 1341 observations and the individual columns. The first column has the column named genes and the type is character, i.e. this is string data and then we have columns of numeric values which are, in this case, log of expression enrichment. And this is class data frame. So as I take you through task solutions, I'm going to put them into this file here. We had some solutions to the sub-setting examples of yesterday and people asked me to post them and I didn't and I think they're now lost because I think I just closed everything. I was a bit tired, sorry. So I'll have to reproduce this. I'll try not to do it this time and actually save everything in the file which I will then from time to time upload to GitHub so you have it current. The advantage is that if I type it into a separate file I'm not going to change the line numbers so my line numbers and your line numbers should be the same. You might want to open a file of that or a similar name yourself if you want or just type into the script here. Okay, now sub-setting and filtering. Here's where we go into quiz mode. Sub-setting and filtering, working with data frames is really crucial and key to all kinds of analysis. Here are three quiz questions for you to solve. This is your next task. Write our expressions that get the following data from this object, LPSDAP. The first one is rows 1 to 10 of the first two columns in reverse order. So we want the 10th and then the 9th and then the 8th and so on. Secondly, gene names and expression values for the column mode.lps. So these are monocytes that are stimulated with lipopolysaccharide for the top 10 expression values. So at first you'll need to figure out how to sort or order your data according to the values here and then you simply pick out the top 10 or the bottom 10 depending on what the order is. If you're not sure about the order of command you can use help or just type question mark order to get at the help page. And finally, list all genes for which B cells are stimulated by LPS by more than two log units. So these are things that we've done yesterday for those who weren't here. These are things that I hope that people who didn't take the introductory course kind of would know how to address if you don't. This is a great opportunity for you to start thinking about how do I solve a problem that I don't know how to solve yet in R. Structuring things step by step and then figuring out how to address each single step. So let's do this. Once again, if you need help, post your pink, post it and we'll try to help you out here. Perhaps we can note down a solution for the first task. So everybody pretty much passed that one? Yes, no? Maybe? Sort of? Kind of, you don't know? So what would that look like? Rows 1 to 10 are the first two columns in reverse order. Yes, that. And then we need to extract something from that so we use the square bracket sub-setting operator. And then we can use 10 to 1 as indices for the first 10 row names. So this expands into the numbers 10, 9, 8, 7, 6, 5 which you could also have written by hand but we don't have to. And then 1, 2, 2. So this gives us the correct result, row 10 to row 1 and columns 1 to 2. It just so happens that 1 to 2 is the consecutive number of the first two columns. If we want different kinds of columns, we could also have written it like this, which is the same thing, but more flexible to expand into things like this here where I would have the control values. So whether you use a consecutive range or explicitly specify the indices in a vector doesn't make any difference. It's all evaluated in the same way. Sub-setting by row and column index. Incidentally, this here, letters, is an in-built variable in R. It simply expands to a vector of all 26 letters of the Roman alphabet, the Western alphabet. Would anybody care to explain what order actually does? Go ahead. Can you do a vector with some ranks of every element in the vector by its order in ascending order? Yeah. Okay. So it ranks the elements of the vector that you give it in ascending order. This works for numbers, and this also works for characters, for letters. Now, the ranking is useful because it essentially implies the indices that you need to use if you want that vector in assorted order. So this order here says in first position we have element number two, which is a B. In fifth position we have, in second position we have element number five, which is a G. In third position we have element number four, which is I. In fourth position we have element number one, which is R. And in the final position we have element number three, which is Y. And of course this ordering now is alphabetic. So then, if I use this expression of the vectors for this here, I will retrieve the elements in X in this order. So first number two, then number five, then number four, then number one, then number three, and this gives me an alphabetically sorted vector. Now this is important because if I have a data frame, I can retrieve this ordered or ranked vector from one of the columns and then subset the rest of my data frame or whatever I'm interested in, according to these instructions. For example, I can then retrieve the gene names that correspond to largest or smallest values. And in principle what I would simply need to do is use this expression and then apply some kind of a subset. So I can actually take this vector here and then get the first three elements only, for example. Or I can assign it and then get the first three elements like that. Same thing. Now with that principle, you should probably be able to pick out the ten highest expression values from a column in LPS data. So hint, of course, if you do exactly the same thing that I've done here, you will get ten lowest if you simply pick the ten first rows. But you went the ten highest. So you will probably need to tell order to sort in descending order rather than ascending order or decreasing order rather than increasing order. Okay. So I hope you've all been able to solve this. If not, do ask us if you're stuck somewhere. I just mentioned the canonical. There's several ways to do this, but the canonical solution, I guess, would be having LPS.dat extracting from that the rows for which this condition holds and simply giving me genes and these column names. Oops. Okay. So these are the gene names and these are the expression values in decreasing orders and these are the rows where this came from. Now this is one of the nested expressions that we use so often. This is, if we want to break this down and look at the sub-expressions, of course, this is quite easy now. I can select something here that I want to inspect and just press Command-Enter and I get this sub-expression. If things get too long on the right-hand side, of course, you can put in extra lines and then split this up to make it more readable. Okay. So that's this task. What do we need to do here? All genes for which B cells are stimulated by LPS by more than two log units. So what are our B cells? Well, if we look into the column headers, B cells are these columns here. B control and B LPS. Okay. So what does it mean that B cells are stimulated by LPS by more than two log units? What would be true about these genes? So for example, this one here in row 43. How much is this gene stimulated by LPS? One log. One log? A little more. 1.2? 1.2. Yeah, 1.2. So if I stimulate with LPS, the expression value changes by 1.2 log units. So apparently what I need to do is to take a difference. I need to take a difference of the value in this column with the value in that column. Now, most operations in R are vectorized. So I don't need to do this row by row by row. I can do this all at once. I can simply say this column minus that column, and that gives me the difference. Okay. So for our first step here, we simply subtract these two columns. How do we specify the column? Well, there's two ways in principle. One is by saying, I want all rows, and I want the column name to be control. So this is the column, which is named this way. I could also have said column 2 because it happens to be column 2. But putting a number here if you actually have a name available gives you a brittle solution. It's better to actually use the name and thus make sure the semantics of your data remains tied to your analysis even if you should insert a different column or drop a column along the way and forget to update your column numbers. So this is one way to specify this. The same thing can also be achieved with the dollar notation. So lpsdat$b.lps is the column name here. So these are exactly identical. I find myself using this dollar shorthand a lot recently. It's maybe not as explicit as this one here, but it's very convenient. Okay. Well, if we simply execute that, we get 1341 expression change values. Now what did we want to do? We want those that are stimulated by lps by more than two log units, i.e. we want to find those for which the value is greater than 2. So we simply complement that to a logical expression, greater than 2. Now if I execute that, what do I get? True and false. Lots of truths and lots of false. The truths cluster at the top here, this is simply due to the way that this dataset happens to be organized where the stimulatable genes in these cells are at the top here, but there's also some true value cells there. Okay. So what do we do with that vector, or how do we get that the genes? So this is simply a vector of true and false. How do we use it to actually extract the values from our dataset? Yes, but that if statement would apply not to the entire vector, but just to a single condition. So if statements are not vectorized. So if I would put this vector into an if statement, it would look only at the first value and thus it would give me all of the genes and it would also issue a warning that there's more than one logical value here and the rest has been truncated. So in principle you're right, a conditional expression and if statement would work, but not in this context. You would have to write a loop that goes through your dataset one by one and every single step applies that if statement. But there's an easier way. That we can also do. But the easiest way is to simply use this vector of logical values within the square brackets. We can use logical expressions within these square brackets. That's it. So the vector goes in there. We extract everything for every row for which that vector is true and we print the values for the genes. So all the genes for which B cells are simulated by LPS by more than two log units. So things like this should be at your fingertips. We've practiced some of these yesterday. For everybody who was here yesterday, this is just consolidation. If you weren't able to remember it, this shows you how important it is to revisit these things and practice them time and time again. Stuff like this will come up every single day that you want to work with data. You will look at your data and you will find something curious about it and you will want to say, oh, but I want to know which genes are high in this column, low in that column and unchanged in this third column. How do I do this? How do I write this? Well, it's some variation of what we've done here or discussed yesterday. Those of you who weren't here yesterday ask for the link to the course materials and you can download the two art projects for the introduction and the introductory material and study that to update yourself with this. This is basically the essential stuff. This is your very basic chords on the guitar. This is how to play a C chord. Can I ask a question? Yes. Yesterday you did talk about missing values, A and all that. I have a data set that I'm really thinking of applying this to and some of the entries are undetermined. There's no value there. In principle, there's two things you need to consider. Many of our functions will not work out of the box if any value is missing. However, they usually have a parameter that you can set and if I remember correctly, this is Na.Rm equals true or false. So remove Na values, you can set that to true or false. For example, if I do something like have a vector 1, 5, 8. Okay, now this is a vector, a numeric vector of four elements, then integers 1, 5, the value of pi and not available. Now if I try to get an average of this, I get not available. The function mean tells me a number is missing and I don't know what to do in this case so I'm telling you the mean of this is not available or the numbers isn't available. Now if we look at mean, there's a parameter here. NaRm equals false. This is the default for this parameter. So all we need to do is to set this NaRm to true and that tells it, skip the values. So if I say mean x NaRm equals true, remove the Na's, I get my correct value. Okay? And as you can see by eyeballing the result, this is the sum divided by three values, not by four values. So the Na completely doesn't count, it just gets removed. Now of course there's more to be said than that. You can either work with functions that allow this, you can manually remove the values if you loop through them, you can use the function isNa, which is also vectorized, which is false false false true for my vector here. So I can use that in an expression statement, in a conditional expression on my dataset, give me all of the rows in which isNa is false. Whenever isNa is true, just skip over these. And this will remove all of the rows. So I don't have any missing elements. So this is how you can remove not available values. However, sometimes you don't want to remove them. Sometimes you want to have a good guess of what these values could have been. And this is the topic of imputation of different values. Now it's an interesting topic, how best to do that. For example, you could say, well, we just take an average of all the other values and put that wherever something is missing. This has a certain bias which can be problematic. A better solution may be to randomly sample one of the existing values and replace the missing value of that, which has less of a bias. But whatever you do, you can handle it. So if this is a frequent problem, Google for or read up on imputation of missing values. Just to clarify, when you say remove below, in this case, you want to remove the cell, not the whole. Well, you can't remove the cell. It's already removed because it's not available. If you have a data frame and you just remove a single cell, that won't work. Everything else will go out of sync somehow. So it's already removed. It's flagged as not available. Some analysis, for example, if you try to compute an average, will require that all of the values are present, and that is what you handle in this case. Okay. I've gone eight minutes into your coffee.