 So I'm, I'm Boris. I've been, I've been, I've inherited this workshop. And I've been teaching it ever since. It's, it's grown, it's expanded initially. It was very statistics focused. We quickly noticed that what you guys really need is not so much learning statistics, but learning to work with programs, with programming in principle, thinking about problems, structuring your problems, breaking them up into little steps, translating whatever you're doing in, and thinking about as workflow, translating that into working code, testing code, debugging it, applying sound software principles and all of that. So this is what we'll mostly be focusing on. Oh, by way of introduction, my, my original background is actually in medicine. I was, I was a young medical student at the University of Munich, went through all that, graduated from medical school, got my license to practice and thought, what am I going to do with my life now? And, and I started looking at, at, at a research laboratory and somehow ended up doing a PhD thesis in molecular biology, in, in real molecular biology. And I, I kind of got infected with the research virus. That was so fascinating, trying to apply all your knowledge on your thinking to discover new things about the world. I guess I went into medicine simply out of curiosity. And I found that this curiosity was most satisfied in, in, in, in research. And so since working days only have like 36 hours in which you can do stuff, at some point I had to make a decision whether to continue doing medicine or whether to continue doing research. So I've decided doing research and after doing my PhD thesis in molecular biology. I did a post, a very wonderful time as a postdoc with, with Robert Huber in protein crystallography at the Max Planck Institute for Biochemistry in Martin Street, where we studied protein structure. So it got a little more theoretical. And even though I've been programming for longer than, before any of you were born, probably, I, I got to write some real code at that time. In FORTRA on, on, on very old mainframe computers. After that, looking at protein structure in more detail, I became interested in what makes proteins fold. Where is that information? How is that information even generated? How can we go from a linear gene and build a three-dimensional structure of something? And magically have itself assembled. What is the nature of that information? And so to, to pursue that a little further, that, that quest, that curiosity about the nature of information in, in life, I started getting into protein engineering. If we kind of think we understand what a protein is about, can we then swap out individual amino acids and make the protein perform something useful? And I applied a theoretical approach to that, that I had developed, which is brutally simple. Simply taking sequences, collecting statistics of homologous proteins, which amino acids appear in which position. And if the set of proteins is well chosen, you can approximate the distribution of amino acids at every position as a, as a canonical ensemble in statistical thermodynamics. So Boltzmann's law applies. You look at frequencies and from the frequencies you derive free energies. And that's a wonderful way to work. That's something that, that we can, we can find all over nature and all over biology. The reason why it is so important and so useful to work with data. We can simply count occurrences. And if evolutionary is working and shaping the, the distributions, the profiles of how many things occur in which space, then that gives us an idea about a statistical free energy that's operating on the evolutionary, to, to be able to quantify the evolutionary pressure that's active on things. So this is what I was interested in, in, in, in my postdoc time, in my first research group in protein engineering. And we actually learned how to stabilize proteins in a, in a very predictable way from sets of homologous sequences. Based on that, I was, I was recruited to come here to Toronto in 2001 to the University of Toronto, where I initially was working very much on protein engineering and then kind of got more interested in the bioinformatics side of things. As you know, the wet lab can be a bit frustrating. Computers tend to be more reproducible, not less frustrating, but more reproducible. So, so I got into that. I started focus, focusing more on bioinformatics teaching. I actually direct the university's undergraduate specialist program in bioinformatics and computational biology. The teaching is actually great fun because it motivates me to not just focus on the parts, the small parts of my domain that I'm interested in, but spread my net more widely, keeping abreast of current trends in bioinformatics and certainty in molecular biology and molecular medicine. So that's the journey of curiosity, taking me from medicine to molecular biology, from molecular biology to biophysics and protein crystallography, from biophysics to protein engineering, from there to bioinformatics. And I've always joked if it gets any more theoretical than that, I'll be doing philosophy, but last year I actually started collaborating with philosophers on a number of projects on the biological side, something related to systems biology, which is turning out to be quite interesting. So there you go. Now, most of what we'll be doing today is going to happen in an R project. And I desperately hope all of you have gone through the pre-work and have installed Git and R studio and R on your computers. And in the pre-work you've learned how to download an R project from GitHub, and you've installed a project folder on your computer. So the first milestone of this course is to download the project for this course, which lives on GitHub at... Oh, I can just write it down here. So this lives on GitHub at httpsgithub.com. So what you need to do is you need to open your R studio. Not navigate to that address. You need to open your R studio. You need to go to the menu file and choose new project. New project. Then you click on version control. Then you click on Git. And then you enter the repository URL, which is https.com. H-Y-G-I-N-N. Thank you. Computers are picky about that kind of thing. The project directory name should auto-fill to R-E-D-A. You need to browse on your computer to find the project folder that you've created. And within that project folder, this project folder is going to be created if you click on create project, which I'm not going to do now because I already have it. And once you are done with that, the project should introduce itself, say welcome, and then ask you to type in it to set up the session. So type in it, and this will source and load some files from the directory. And if you are done with that, please put up a blue post-it, and if you get lost with that, please put up a red post-it. This is our first absolutely crucial milestone. See some people still working on that. If you're not actually still working on that, and you completed it, but don't have your blue post-it up, you could learn. Could you help Leon? Blue post-it. Any principle problem, Brent? Just my name is spelling. Spelling? Yes. As I said, computers are picky about that kind of thing. All right. Good. So type in it, this init function initializes a few things. Let's talk briefly about what's in the box of the projects. Our studio projects are supremely useful. I use them all the time. This has really changed the way I work. So whenever I think about a data analysis question, I make a new R project about it. I start writing scripts, and I put everything I do into that script. A good setup is having a script where you actually have code, perhaps also storing ideas about it in a separate notes file, but just put everything into a script. Even if you think you'll try out some things on the console, that's actually much less useful. Put it into a script, load things from the script. It's much easier to edit and go back and to keep an overview of what you've been doing. And perhaps even remembering things you've done before elsewhere and then copying code and doing things. So this workshop will be almost exclusively driven from script files like that. In the introduction workshop that we went over the last few days, we had very little code, actual code to work within the files, and our poor students had to write everything themselves. In this workshop, we'll be working with functions that are often a little more complicated. And the point is to demonstrate how the function works, basically as templates that you can use later on. So there's much more code in the actual examples, which we'll then just be executing and going along. Right, we're planning approximately five modules for this workshop. The code that we'll have posted by the end is pretty much self-contained, so if we don't get to cover something, you should be able to go through the code and work through it on your own. And you can also come back to it at any time. Now, as I develop these things, I end up updating and changing things, not just at the last minute but after the last minute and beyond. So this is something that's very often constantly influx. It will keep on living at that site, but I'll update it from time to time. And when I update things, I use version control to commit the changes to my local version control here. And then I push my changes to the master repository on GitHub. And then whenever something is new and ends up there, all you need to do is you need to load the project or be in the project and then say pull branches and that just pulls down the updated information that you have. This is extremely useful and very quick to share code. If I change a few lines in the scripts here, I can push it and then pull it back in. Now, there's a bit of a conflict. I don't think we figured out how to work with this optimally, but here's the thing. You need to write lots of comments into the code. You need to experiment with things, try variations of parameters, note down your ideas and so on. Now, if you do that and you save your changes and then I update something and you try to pull that down, Git is going to complain and say, whoa, we have a merge conflict here. Your local version has additional information which has changed from the master copy up at GitHub. And if I just pull down that master copy version, we're going to overwrite your local changes, which may not be what you want. So this is why you should not be editing the files, REDA, REDA regression, REDA introduction, REDA clustering. They're basically for you to read and work with. Now, where do you put your comments then? Well, in the introductory workshop in the last two days, we made local copies of these files from the setup script. And these local copies are then called, say, my REDA introduction. And you can copy these and work with them and save your changes. And then if something gets updated, I'm updating only this here and not the other copies. So you're basically working with two sets of information, the one with your local copy with your custom notes and information and the one that comes down from GitHub. Now, there's a problem in that because that means your local copies are not going to have the updates that I put into the other script. So that doesn't work that well either. So in this workshop, we'll adopt a different strategy. So if you've been here the last two days, don't get confused. There's a file called myedagnotes.r for you to write your general notes in. This file does not live on GitHub. So anything that I do on GitHub is not going to overwrite this file. There's not going to be a merge conflict. So put all your notes into, open myedagnotes.r, put all your notes into there and let's see how that works. It'll require you switching back and forth between these two tabs. In fact, in the RStudio interface, you can detach a file like this by just dragging down the tab. You have to drag it out of the RStudio window and then you open it in a separate little window. So I don't know what's more useful, keeping it in a tab or keeping it in a separate window, both are possible. Now, if you do want to edit one of the actual code files, you can do that, but then just save it under a different file name. So for example, if I do want to edit this here, I can put some edits in and then don't save it, but save it as whatever, for example, myr.edu. So you can do that, but note that these local versions are not going to be updated from GitHub. Confusing? I think I'm already confused. Are you confused? It gets worse. Keeping a journal. So for things that you write as code snips, as pieces of code, use this little file, myedynodes. If I develop things online, I'm always going to put this into this code snips file, which I'm going to update from time to time. So if we come up with particularly crafty solutions to some of the tasks which we're discussing, the code is going to be in here and then upload it and then you can copy it down and put it into your journal or wherever you want to keep it. But in principle, you keep your notes in myedynodes.r. But that's good for codes, for ideas, for concepts. It may be much more useful to write them down by hand. So whenever I go to a lecture, I'm writing constantly all the time. Not so much because I'm a slow learner and a slow thinker and I need to go back to my notes and reread what I've just heard in order to read it several times to finally understand it. The idea is more to use writing, to use keeping a journal as a tool for focusing. As you write, you're sure and you paraphrase what I'm saying, the concepts that I'm talking about, you can be sure that you're actually actively engaging with what I'm saying. It's not all just going over your head and you're thinking very different things about the beautiful weather in Toronto and how nice it would be to visit the islands instead of being here. So write. Write down all ideas, write down all concepts. This is going to be very, very useful. Once again, this is not a course on the internet. You're here because we're here to explain things and provide you with insights and emphasis and background and context that you wouldn't find anywhere else. Share experience. So profit from that by writing it down. Keep a journal. I'm emphasizing this more and more in all of my bioinformatics courses now. Between 20 and 25% of the final grade, of the final grade, is given for student journals. That's not because I want to see what students write in their journals, but to emphasize to everyone that this is crucially important. You might have heard the term reproducible research. It's getting to be a more and more important thing, the fact that if we publish data, we're going to be more and more detailed in our requirements to provide an exact trace of where the data originated, how it was manipulated, where it went and so on. And if you don't keep scripts and if you don't write down your ideas in journals, you're not going to be able to do that. I know that from experience. All right. So what's in the box? You have our studio here. As you know, there's by default four different panes open. Pain as in window pain, P-A-N-E, not as in discomfort, P-A-I-N, even though the distinction in programming is sometimes not very obvious. The lower left one is the console. This is where we simply type single commands. The upper left one is the script pane. This is where I do most of my work. This is where I write everything because it's very easy to transfer commands into the console, things like getWD. If I type that into the script pane, how do I execute that? I do what? Control enter. What does, but what and how? You're omitting a detail here. Okay. I'll type control enter. Nothing happens. I need to put my cursor into line 125 and then type control enter. That will execute the command, getWD, which shows me the working directory. So I can type this here and I can retrieve what I type in the history pane and I can double click things I find in the history pane. I can even search in the history pane for commands that I issued ages ago. I can double click on something here, here, and that loads it into the console so I can then either execute the command again. Or I can edit it. But most usefully, I put my commands into the script and execute them from there. So what gets executed is a little bit dependent on scope. So as you've noticed, the cursor has to be in the right place. If I execute something simply having the cursor in the line, I'm actually executing the entire block. So if my block of an expression spans multiple lines, I will be executing multiple lines at once. So for example, I should put that in code snips. If I do something for i in 1, 2, 10. Alright. So what does this do? It's a simple for loop. Dan, what does this do? I've done tons of these yesterday. I, from 1 to 10, you go by 3's and print whatever it lands on and then also print the square of that number. Okay. Good. Now, if I put my cursor in this line here and press command enter, I execute the entire loop. If I put my cursor in this line, I also execute the entire loop because basically context is this an expression between the brackets. So I don't have to be in the first line. I just have to be in the right context for this entire enclosed expression. If I put my cursor in here, I just execute this one line. It's a different context. If, however, I select something, I execute only the selection. So for example, I can select i and that will show me what. If I press control enter, what do I get? Control enter of i. Exactly. The last value of i, or I would say the current value of i, which is the last value that it had when it went out of the loop, which is 10. And that's really useful because often we have expressions in R where we have functions and parameters and functions as parameters of functions and complicated selections. And they're all evaluated from inside to out. So if I do something like that, and I'm not exactly sure what I'm actually doing here and what this is, I can just select the sub expression, press control enter, and then execute the selection alone. So this is the expression sequence 1 to 10 by equals 3, which gives me 1, 4, 7, 10, which also shows you how the very, very versatile sequence function works from the first to the last. And then we can increment things. Just to note this here, there's two main important parameters for the sequence function. One is by, default is 1. So if I just say sequence 1 to 10, it's the same thing as using the colon operator. By gives me larger intervals or smaller intervals, I could say something like 0.5. And there's also the possibility on specifying the length of the output, which we could somehow compute by hand. You want 21 values because you just need 21 equally spaced X values to plot something. Then I can specify, give me 21 values. I type the first three characters of the parameter. Our studio then tells me what parameter I can put in there that begins with that. It also tells me what that parameter is, or if there are alternatives, what these are. So length.out is the desired length of the sequence, and it has to be a non-negative number. And so press tab and then say something like 21. So then I get 21 elements. And the function automatically computes the correct increment, which I could do by hand, but I really don't want to because I would certainly mess up and need to debug it for half an hour before I finally get it right. Okay, so that was an aside simply for executing commands, and you will be executing a lot of commands in the scripts. Put the cursor into the right scope, press Command Enter or Control Enter, depending on whether you're on a Windows machine or on a Mac, that executes an entire block. Or if you want to be more specific, select something, do the same thing. It can be a multi-line selection or just part of a line, even just a single character, and that gets executed. Appear in the history tab if you should need it. You can recall things by double-clicking and then execute them in the console. But of course, you can also just use your arrow keys on the console and move up and down to recall commands, and then you can edit them and change them and execute them again. So there's many different ways. In principle though, the way I like to work is to put all the commands I use into my script and edit them there and save my script from time to time. Why did I get sidetracked with that? Where was I coming from? I have no idea. Let's continue with discussing what's in the box. Okay, so the first file here is gitignore. That's just a file that tells what is going to go into updates. You don't even need that unless you are committing to your own repository. What I usually do is just copy that from an older project. I don't want to keep our history files. I don't want to keep updates to these operating system-specific files and so on. Let me say something about how this is configured with our history and all of that. By default, in the project options, I've set up the project to not restore our data into the workspace, not to save the workspace and not to always save history. By default, these are on. I've just switched them off for this project. Let me explain briefly what happens here. Normally, by default, R would save the entire workspace to a file called rdata on exit. When you exit whatever you're doing, everything that's in your workspace is saved to this rdata file. The rdata file then contains everything that's here, the functions that you've loaded, the values that you've loaded, the objects that you've loaded and so on. When you then restart R, this gets reloaded and you can continue working with the same workspace that you've used before. Just like that, you would agree that that's probably a very useful thing. But I say, no, don't do that. That's not useful at all. That's supremely dangerous. That basically loads data into your R session from a source that you don't completely have under control. You don't remember if the data files that you have in there are these the ones that are actually broken because you experiment with something that didn't actually work, or is that the correct version? How do you figure that out? So things can go subtly wrong if you make assumptions about the things that you load into your workspace. What you should be doing instead is recreating your workspace from scratch from zero with nothing loaded from your scripts explicitly. And in those cases where there's some very expensive calculations that run for hours and then create some intermediate data, save these intermediate files explicitly under a file name that you can recognize. With a comment in your script what they contain and a version number and load them explicitly when you restart. I think it's a very unsafe principle. You know, this is something that program... Well, let me not talk about Microsoft programs. It's a very unsafe principle for computer programs to do things implicitly without you being under control and knowing what goes on. The chances of you corrupting your data and doing something extremely wrong are very high. On the R help mailing list, which I sometimes contribute to, we often have people saying help, I'm starting R and it crashes as soon as I started. Well, what usually happens is that there's some thing that's loaded which is incompatible with life on the computer as we know it and it got saved into the R data file and now R loads it on startup and that crashes the program and that's not good. Solution of course is go to your product directory, delete that dot R data file and then basically start from a bin of a start. Better to not create it in the first place. So this is the thing about saving the workspace to R data on exit and restoring R data into workspace on the startup. I don't do that. It's switched off. Be careful. When you make your own project, it'll be switched on by default. If you wish to turn them off, you'll have to do that explicitly. It's kind of similar with the history file. Always to save history, even if not saving R data. I also switch that off. I don't need history files because I don't rely on the history which is stored. I rely on the commands in my script file. Everything is in my script file. That's where I can read what I've been doing. That's where I edit it and that's where I pull it out. So very often I open my project. I go to the main file that basically controls everything and then I just click on source to put my program into a defined state and load everything that I want to do. By default, R, however, loads the file that's actually quite useful which is called R profile. So this R profile is an R script which gets loaded and executed when you reload your project. It's a little bit more involved. Actually what happens is if you start R in a particular directory, the R profile file in that directory is going to be executed. But in this case, if you work from a project, that's the project working directory. Initially I've also set up the projects to have the working directory as the project directory. So before saving everything, I made sure to set working directory to the project directory. Even though my project directory is named different, this seems to be transportable in the sense that your working directory is now also the project directory. So R profile. R profile contains just two things. A function definition and printing some output. Not using the print statement, but the cat statement, which is kind of like print. It means concatenate. It doesn't tell you to print kitten. But anyway, this created the start-up message. Welcome and type in it to set up the session. When you then type in it, it sourced a file of commands that is also not general to R but unique to the way that I've set up projects. So this .init.R contains some initialization code, which is a bit specific to the way that we run workshops here. It does two things. First of all, it sources local functions. And it tells you, I'm sourcing local functions from the R directory. And then it uses the function list files with the path of the R directory in the files here. And this is a UNIX convention of how to specify relative paths. So if I'm in the local directory, anything within that directory is a single dot. Everything in the directory above is just two dots. You actually see the two dots here. If you click on that, you go to the directory above to your workshop directory. So this means in the local directory and in a sub directory or file, which is called R. So this identifies the R folder in which I want to list files. The pattern that I'm applying here is all files that end with a .R. This is a regular expression. We'll be talking about regular expressions a little later. So from the directory here, all files that end with R, and don't give me just the file name, i.e. the full name, i.e. the full path with that. So the R directory contains three files that end with R, by code, object info, and read fast A dot R. So this expression list files, path, pattern, and full names generates a vector of three file names. And then I put that vector into the condition for a for loop. So this is my vector here. And then I iterate my loop over this vector. And for every iteration, I assign one element from the vector to a variable called script. So this for loop iterates three times. And then the first time the script variable is R by code dot R. The second time the script variable is R object info dot R. Third time it's read fast A dot R. And then the command source script actually runs that script. Now what's in a script like that? Everything that I put into the R folder is a script that can load one or potentially more functions. So read fast A dot R is a function file. It has a header. Some information about the version. Its purpose. It describes the parameters. For example, the parameter fn that's loaded here is character, variable, the file name of the input file. It has a single return value, i.e. here, a character vector, which are single letters of a fast A sequence. And then the actual function code. So this is the function that's loaded here. And as I source the file, that function is known and enters into the workspace. So this is one way of initializing my workspace flexibly with functions that I keep in my R directory, which are utilities for the project that I want to use here. Okay. And then in my function files, but this is optional. The function will work perfectly without that. In my function files, I also put two blocks of utility code. One has examples with which I can remind myself of how I intended to use this function. It's kind of like the example section in our help file. And one has tests that I can use to make sure if I change something in my function, I can then run the tests and make sure that all the output is still correct the way that I intended it to, which I'm skipping here. We might talk a little bit about testing later on. Now, for those of you who are here for the first time, we've already discussed this in the last workshop, but this is a bit of an odd construction, isn't it? If false, why? What does that do? If false. If is a conditional expression, but what does if false do? And why am I putting things like, you know, actually testing my usage example in an if block with if false? What do you think? I'm going to source this file. I won't exactly do this once, but you can still do those examples for later. Exactly right. So when we source this file, everything in there is not going to be executed. A conditional expression is executed if the condition is true. So we usually put some tests into the condition if length greater than zero or if number of files is exactly 25 or if expression value equals 100. Something like that. And then we execute our conditional expression based on that. But it gets executed if it's true. Now, if I write explicitly false, this means it can never be true. It's never executed. I'm writing a block of code that is not executed. Why that? Well, when I source this file, I don't want this to be executed every single time I start up my project. But I still want the code because when I edit my file and I change my usage examples or I finally get around to writing my tests, I put that code into there and then I can simply go into here and execute this line by line, define myself a file, write some lines into this temporary file, use the readFastA function and check that it actually works. But that doesn't always happen when I source the file because I'm hiding it from execution in this block here. So this is the way I kind of tend to set up my R code, write single functions into files, put them into the R directory, be a little bit obsessive about actually writing what the purpose is, actually writing about what the parameters are expected to be, actually writing out what the results is. In these simple examples, it's easy to remember. In more complicated examples, you won't remember and you will be tearing out your... That person half a year, you yourself half a year from now when you've forgotten what you were doing at that time is going to tear out their hair and curse past you because you didn't write comments and you didn't write proper function headers. So don't be that person. Comment your code. And if you think this is useful and you want to adopt it, there's a function template here which you can just copy and paste or save as. Now notice that this has nice syntax coloring. The function template does not. That's because it doesn't have a .r extension. So R doesn't actually... RStudio doesn't actually recognize this as an R script file, even though it is. And I did that because... Well, the init code just picks out everything with the extension R. And if I call this function template .r, it will get executed upon startup. So I was thinking for a while, should I special case this and just remove this from the list or call it differently? You have to balance your options sometimes. One of the two should happen. I don't want to execute this function template. But I can put a little bit of a special case here and say if my file name is function template, don't source it. Or I can just name this in a different way. I think this is slightly more general. Special casing code. If you ever find yourself doing an analysis and needing to write special cases to handle odd situations, maybe it's time to step back and think in a more principled fashion about what your code does. It's usually more of a symptom of bad concepts than of something that you actually need to do. So that was the init function and how all of this works. I think we'll get to everything else as we go along. The scripts folder has some scripts to be working with. And similarly, it has a script template. You could use this as a template to set up your own project scripts at home. There's a header there, setting purpose, version, date, author. Usually we start scripts by setting the working directory into whatever your project directory is if it's not automatically loaded that way. Loading packages. Let me say something about... You will maybe encounter this idiom about loading packages. I use that quite frequently. I think I've discussed this in the preparatory workshop, but it's worthwhile mentioning here because it could look confusing. When you load a package from CRAN, for the first time, you have to install it on your computer. That's extremely well done and extremely valuable. The people working for the CRAN or comprehensive archive network or for Bioconductor are extremely knowledgeable and extremely fastidious in making sure that only tested validated code that actually works and that doesn't install ransomware on your computers from the networks to researchers everywhere else. So CRAN and Bioconductor are extremely useful. There's literally thousands of useful packages there, but you need to install them on your computer to extend R and to customize it for functions and we'll be doing that quite frequently. Now after you've installed them, they live on your computer and you need to load a package every single time for a new R session. RStudio allows you to do that from the packages window so you can scroll through the packages that are available on your computer and if you check one of the checkmarks, that package is automatically loaded with the library command. So I need to install a package and then I load it as a library. Now these are scripts that I source and I might source them every single time I start up my project. So if I put install packages and then library into my script file, I would be accessing the internet and potentially downloading a few 10 megabytes of data every single time I upload my script. That's not a good way to work. So we use a different paradigm instead. There's another function which is very similar to library. It actually does almost exactly the same thing loading an existing package. It's called require. So require loads packages just like library does but require has a return value. The return value is a logical value and the logical value is true if the package was successfully loaded and it's false if it was not successfully loaded. And most commonly, the reason why it was not successfully loaded was that it did not exist on the computer. So what we do here for running the scripts is we say require seek in R. If seek in R does not exist, this will return false. The exclamation mark and the conditional expression here turns that false into a true and then the contents of this if expression, the conditional expression is executed which is install it and then load it. So again, in this way, I keep my scripts from doing things when I source them, doing things that I don't want them to do but I can make sure that everything that I need is actually there. If I execute this, all that happens is not very much but the package seek in R is now loaded. What if you have it installed? Will it just automatically load it then? If I source the script, if I have it installed, it will get loaded and then the commands are available. But that will already happen through the require command. So I don't even use the library command at that point but the require command. But in that way, when I source my script, I don't install the package new every single time. Another way to possibly write this is to do it in the following way. Just comment out the install packages but keep it there. Thus again, if you run your script, this is not happening. If the package doesn't exist, your script will stop working at that point. If you go through it, it will complain that the package doesn't exist and you can then do it manually. So this require thing is a way to do it automatically but you can just write it manually. The point that I'm making though is don't put this expression into your scripts that you source frequently because that will install the package every single time, which is probably not what's intended. I think that that kind of covers what's in the box here and how it works. If anything about that is not clear, do let me know. It's kind of useful. It captures a lot of experience on how best to set up projects and to work with them so maybe that's also going to be something useful for your own work later on. Now, good. We have some time before the coffee break. Excellent. Let's get started with some actual work with R and exploratory data analysis. And this is the file reda-introduction.r and we'll work with that. So once again, try not to edit this file. If you do edit this file, save it under a different file name. We're going to update it later on. Now, there's some files that are specific to this particular module, not just in general. In the assets folder, we have two papers. And these two papers are, you know, the product lives on GitHub, but these papers are not available for as open source, so I can't just put them on GitHub like that. I've put them into a zip archive instead. So if you double-click this paper to open it, the zip archive will ask you for a password. What's the password? Well, you don't know, right? Yes. So what do you think the password would be? What would be your first guess? Not password. No, that's the name. That's me. More of one of these, I don't know. But it's a little more specific than that. Fort, memorable. CBW. So just type CBW. And this opens a paper, very cool. It's actually almost ancient in the field, like, wow, four years old. Do we even look at papers that are older than six months? But it's very nice. Chaitin et al. Massively paralleled, single-cell RNA-seq for market-free decomposition of tissues into cell types. One of the first papers that actually used single-cell expression analysis to do something useful with it. And we'll look at a few things, a few data sets that were covered by that. Oh, which reminds me, An. When this gets edited, could you take the little thing from the recording out where I actually tell them what the password is? Just blank that. Because this is a science paper and we shouldn't just hand it out to people. Is it not really available yet? No, it's on science. Well, for us, yes, from the U of T library system it is. Aren't they public after six months or a year now? Yeah, it does say open access right on the paper. It does say open access right on the paper? The NIH, of course, they all go open access. They access it this morning and it wanted a subscription. Really? Yeah. Okay. Weird. Well, try it. Anyway, you have it here. We'll look at it a little more later on. So that's the Chaitin paper. There's another paper beyond bar charts, which is interesting to read. This is basically additional information. If we have checkpoints and everybody's, well, most people are still caught up with doing something that you've already done and you're getting really, really bored, you can look into that paper but I'll say more about that. There's a couple of data files, which we will be using for tests. We'll be working with the file, objectinfo.r, I've already mentioned, function template in the r folder. And in the script folder there are two files, reference and r programming, regx, which are also kind of background information, which we load. There's the script template that I noted and there's a file called unit testing.r. So these are specific. So while you're waiting for others to finish a checkpoint, here's some suggestions. So you can read the paper by Weissgabe at all beyond bar charts. It's also a password protected zip with the same password. Then write yourself a little r script and start trying to work out how to implement the suggestions that are made in this paper, which are good, as an r function. It's kind of different from what your default options to plot things in bar charts or in box plots are. But from things that you've learned in the workshop, you should be able to implement what they suggested. You can also study the test that package. I don't think we'll have time to really go through test that to write tests. But it should be said that for any kind of reproducible research, it's not just important to write code. It's important to make sure that your code actually does what you think it should be doing and to test that and to validate that. And the proper way to do that is to write a number of tests that you just have available with your code and that you can run and that you execute every single time that you change something in your code to make sure that you've not inadvertently broken something. In particular, if you do find a bug in your code at some point, you write a test that would have discovered the bug had you realized the situation earlier on. So that's a really, really sound principle on how to continuously maintain your code. You work with it, you might find a bug or you might add functionality, and when you do that, you write a new test for it which covers the new situation. Yeah, this is going to be very helpful. How to use that and using test that is explained a little in more detail in the script unit testing.r which you can work through in the downtime while you're waiting for other things. And you can also work or should at some point work through the script in rprregx.r. We're going to use regular expressions from time to time. Basically just in the very most basic version. Regular expressions are really, really, really important. They have a bit of a learning curve. Once you get through the first obstacles, you can't understand how you could ever live without regular expressions. They're so useful. So get familiar with regular expressions. We're not going to do too much with regular expressions just a little bit in this workshop, but a lot more is said in that script which is here for your enjoyment. So let's start seeing if we're all set up. Typically, we'll have these commands here in this working script. Execute get working directory. It should be the directory that you recognize as your workshop directory and the project directory. If we list files, we should be seeing the file names of all the files that we also see in the files pane. Do we see that? Is that correct? Are all the files there? Hasn't that? Do you see all the files when you execute the command that you also see in the files pane? You're missing the REDA regression. I see that here. Oh, but you don't have that. Yes, yes, yes, because it's not uploaded to the project directory. It's just something that lives in my folder. But there's another difference. I don't actually get all the files. Which ones are missing? The subdirectories? Well, I get the right. So I don't get the things in the subdirectors. This is not recursive. I don't remember. Does list files even have a recursive option? What do I mean by recursive option? An option to have it go down and descend into subdirectories as well. I don't know. Um, dot, files. Path, pattern, all files, full names, recursive, logical. Well, that's nice. And by default, recursive is false. Recursive is true. Now I get all of my files. And what's also nice is that by default I actually also get the full path with them. Okay, I didn't know we can do that. Nice. Anything else that's missing? What about init? Not here, isn't it? In it? Why? Right. It starts with a dot. In most operating systems known to man, file names that start with a dot are not displayed in the directory by default. These are hidden system files. And your operating system doesn't want you to see them. Not because they're frightening, but because you might be tempted to edit them and that's generally not a good idea. Our studio is an editor designed to edit things and to touch things and to change them. So it would be kind of, you know, ridiculous if it wouldn't show you all the files that exist so you can access them and change them and edit them and hidden files and directories appear in your file pane. But by default, you don't list files that start with a period. And if you do want them, you have to specify all files is true. This is a logical value. If false, which is the default, only the names of visible files are returned. If it's true, all file names will be returned. Got that example here. So list files, all files are true. It shows me all the hidden files. The first hidden file is a single dot, which just means the path to my local directory. The second hidden file is a double dot, which means the path to the directory, which is the parent directory of the local directory. Then I have dsstore. This is something that my Mac produces whenever I change file contents. Now, our studio doesn't show me that. It actually hides this dot dsstore. There's dot git, which is a folder which contains the git project information. Again, our studio hides that from me. It says you have no business going into that folder, which is very good. You never want to touch your version control parameters. Git ignores a file that can and should be edited. It's all the files in your directory that should not be under version control. And then there's init.rrprofile, the rproject itself, and so on. So when I said our studio shows me all the files, that's not entirely true. It does not show me some of the files that it's pretty confident about that I should not be touching and messing about. You don't ask... In principle, what you can do is simply go to commit in the version control. And this will list all the files that are not up to date. Everything that's not listed in here is actually under version control. So you can't directly list the files that are under version control, but you can do the opposite. By just going to the commit pane, you will see everything that is not under version control, because everything that is under version control was downloaded from the master file and is up to date unless you edited it. If you did edit it, it will have a little M icon for modified. These files here in yellow never got under version control. That's because I have them on my computer, but I never committed them. But don't be frightened by that. I'm basically just mentioning this under version control and that there will be conflicts to prepare you. If you do edit something, there will be a warning message and R will stop doing what you want to do. It will just give you a warning message. At that point, we just need to recover. Okay. Similar to list files, we can have list directories. Now, before we can actually get into exploratory data analysis and talk about why this is so much fun and how this is really cool, we first need to load some data. These days, so many different data sources, the web. Either from large databases that are gentle and well-behaved and offer you data for download in easily understandable and well-defined formats or things that just appear somehow oddly on web pages and you need to go through convolutions to get them from the web page. Text files. Very much data is in Excel spreadsheets. Anybody here who does not use Excel spreadsheets? See, everybody. It's so ubiquitous. Actually, one of the files that we'll be working with here comes from an Excel spreadsheet on the supplementary data of the high-tenant all-paper. So we'll explore some data from the supplementary material published with this relatively recent paper on single cell RNA-seq analysis. So, on zip, the high-tenant all-paper looks at what the paper is all about. Orient yourself a little bit. It's quite interesting. Not very involved. And then we'll talk a little bit about how we get the data into R and get things from text files. I think there was at least one person in the room who actually works with blood. Right. Two, three. Okay. Did you know this paper? No, I'm not really familiar with it, but... Okay, great. So, I'm going to rely on you. Whenever I say something wrong, then just speak up and correct me about this. So this one is... Oh, both of them are open access? Oh, that's fantastic. So we can just change the zip files thing and I'll just keep them as they are. Good. Right, so data comes from very, very different sources. Something that I'd like to focus on is a list of gene names that we find in this figure and talk about, you know, how do we get something like that in law? There's a list of gene names. We find it in a paper about... Well, if we're lucky, that list appears somewhere in the supplementary material and we can download it. In this case, we're not lucky. It's not there. If we are lucky and the authors are our good friends, we might get that list in an email. I didn't ask that. I probably would have gotten it, but I didn't ask. Now, this is a PDF file. Sometimes, we can copy things from PDF files and then paste them. So can I copy this here? Nope. Why can I not copy this here? Well, because this is not actually text. It's an image. So at that point, they've produced an image and I'm stuck. There's no real good programmatic way to get that data. So what I ended up doing is I made a screenshot of this, selected only this text and ran it through a program for OCR, Optical Character Recognition. Lo and behold, almost 10% of the genes that I got from Optical Character Recognition were even correct. The rest had errors. So just as a practical thing, if you have a list of text and you need to scan it, whether it's correct, how do you do that? You need a reference set. Right. The reference set is in that image here. So I have text in my text program and I have this reference set. And now I need to compare whether what I have in my text file is the same thing that appears here. How would you do that? Just go line by line. Ask your grad student to do it. Send it out to Amazon. Do they still call it the mechanical Turk? So for very little money, you can actually have humans do tasks like that. If you do that several times and then average the results, you might get very high quality data. If this is a very large list, you don't do it. You actually do contact the authors. But for a small list, have your computer read it out to you. You can listen to what your computer reads and then you can check it against what you see. And that's a much better way to approach this than just to try to visually match column by column and compare that. Anyway, so I went. I did that. The file exists for you to use. It's in data figure three characteristic genes dot text. If you do find an error in that, no, I'm not going to buy your beer. That's not that. But do let me know. And that's that data file. So our first task is open this text file here and get this into a vector in R. So this is now a text file. We want it in a vector. We want a vector that has one element for every single gene that is mentioned in this text file. So is there a good way to do this by hand? Like copy and paste it? Possibly. Think about it. Is there an R function to do this? Possibly. I don't know. Can you write a function? Possibly. You can always write a function. So yes, you can do that. Do you have to? Well, we'll talk about that. So scenario is you've got a text file like that. It has gene names, potentially many gene names. Your collaborator sent it to you a day before going on holiday. You can't get it into a better version and now you have to read it into R somehow. If you're done and you know how to do this, you are allowed to peek into the file SampleSolutionReadText.R in the Sample Solutions folder. If you're not done yet, do not peek. That would defeat the purpose. If you're stuck, put up a red poster. If you're done, put up a blue poster. It doesn't matter. I want a vector. In this case, the specification calls for a vector. So maybe we can quickly and jointly look at the options we have here. So we have a vector of some strings that we pick up from a text file, potentially a large text file. And we want to get this into a vector. Canonically, we get elements into a vector by defining a variable name or something like my V, by assigning something to that variable and that's something that we assign has to be a vector. And typically we compose vectors in R with the C operator unless we use functions that actually produce vectors as output. But if we assemble vectors by hand, we use the C function. So now I need to get these things into here. I could just paste them in here, but that would not be legal R. That will generate an error about unexpected symbols because R expects that there ought to be commas that separate the elements. So let's put in some commas. Okay, so I have some commas. And then I could do this, and now I get a different error. The error I get now is object CD19 not found. That's the kind of error that I've seen a lot in the last two days. What does that tell me? And it should tell you read your error messages. They actually often tell you something useful, not always, but often. So what does object CD19 in that context mean? It's not interpreting it as a string. It thinks this is a variable name. It thinks it's a variable name that identifies an R object, and it complains because, you know, I never defined a variable by that name, nor did I want to because I don't mean that this one to be a variable. So I want it to be a string. How do I turn something like that into a string? Quotation marks. So I define this with quotation marks. Now, if I double-click it, I select the entire word, then if I just press a single quotation mark, our studio gives me both quotation marks. So many of the characters operate for matched pairs around selections. Single quotes do that. Double quotes do that. Parentheses do that. Square brackets do that. And curly braces do that too. And, lo and behold, I get my vector. My V is now a five-character vector that contains CD19, CD79B, and so on. So for five gene names, this is not too bad. How many before I go crazy? 100? It depends how desperate I am. Your limit would be 10. Your limit would be 10, yeah. That kind of makes sense. So is there an easier way? Well, that depends. I have a few sample solutions here, adding it element by element. That's what it would look like. Actually, I've done this here. Oh, my God. You could also enter it by hand all at once. So we can just define a single string which contains all of these simply by copying, just writing something like s, and then quotation marks, and then taking the whole thing, copying it, pasting it in here, and pressing return. So that now is a string which internally has CD19, line break, CD79B, line break, CD22, line break, and so on. So it has what I need, but it's separated by line breaks within a single string. And I can use the functions to function which to separate it. Everybody who was here yesterday knows that I'm talking about the function strsplit. So I can str my s, and I have to define what I am splitting on, and that's line breaks. And I get something very similar without the need to identify every single word. Or I can use the read lines function, and this produces a vector of strings. Or I can use other functions which might produce data frames. Why did I have to use unlist? Jennifer, why are we using unlist? I don't remember. Anybody remember? Okay, so strsplit is a vectorized function which can operate on many strings. And the way that I've called it, I've just used it on one string of words separated by line breaks. But it can operate on many strings. The output of doing it on one string is a vector of a certain length. If I use this on a different string, and my vector has a different length, the output is going to be a vector of a different length. So then I need to somehow put vectors of different lengths together. And I can't do that in a matrix because in a matrix rows and columns have to have the same size. I can't do that in a data frame because same thing. I need to put things into a list. In a list, I can do that. So what strsplit does, it operates on the first argument, puts the result into a list, operates on the second argument, puts the result into a list, and so on. If I give it only a single list, a single argument like the string I had, the result is going to be a list, a single item, which is the vector that I need. So the vector that I need is contained within a list, and I need to get it out to be able to work with it. And this is why I tell it to unlist. Alternatively, I could have written this, i.e. pulling out the first element from the list, which is the vector that I need. So I sometimes use one code idiom and sometimes I use two code idioms. The differences are a little subtle, but for our purposes, both of these work equally. So strsplit creates a list, I need to get the results out of the list.