 So welcome once again. I'm amazed that the room is as full as we actually have to turn people back because the course was fully booked. We couldn't have crammed any single more seats into the room. And that's quite astounding, because initially, last year, if Michelle remembers, we had a discussion on whether this course would even fly, whether there would be enough interest. And as you see, there is so much need for looking at data in new and intelligent ways. What I take home from your introductions is not only that your data is big. You all have very similar data. And you all have data that you wouldn't even have thought about five years ago. So we're right in the middle of the next revolution in quantitative biology, where the field once again becomes data-driven. And it becomes ever more important to handle it. And it's not just that the data is big, but RNA-seq data, as you know, is uncomfortably big in many cases. But that's not the problem. We can easily teach computers how to work with big data. It's just more insane. The problem is really in integrating data in different ways. Your data is diverse. You have numeric data. You have genotype data. You have gene symbol data. So that's text data. And the big challenge is to get all of that together and to make inferences across different data types and handle them. And this is traditionally one of the fields where spreadsheets are really poor. And this is where you really need to work with programming languages. Now, unfortunately, it's considered, I don't know if this is still true, but that biology is the field that you get into if you love science, but you can't stand that. And I don't know if this is true for many of you or any of you at all. But I find very often that people who take the workshops here take the workshops because there is a little bit of an anticipation, the rear, against just starting to program and seeing what happens next. And this is really what I'm trying to overcome here. This is going to be very, very interactive. I have not as much canned material at all as I often have. But what I'd like you to leave here with is the idea that we can formulate biological questions and you construct to them in a way where we can teach a computer how to do that, which is program. So we'll talk a lot about how to look at a question so that we can work with it in R and then how to do it, even if you don't know R. And that's important, as we'll discuss later, because really, nobody knows R. R is much too big. This is one of its big, big shortcomings. So, right, this is going to be a lot about wrestling with R. This is the theme of a little introductory image here. So the learning objectives, Michelle always makes me write down explicit objectives, be able to set up an R project. We'll talk about how to structure files on your disk and where they go and how to make a sane R environment, be able to structure a computational task as an R script, re-select, filter, rearrange, and combine data, be able to write functions and programs. That's really the important part. And be able to create simple analysis and also know where to get help. So when I started this course, developing this course, just by habit, I took a kind of a textbook-like approach. And the result was pretty much very similar to what you might have gone through on the R tutorial where you got the link. Now, I actually need to know, so don't be ashamed. No, that's going to run. Who has spent more than two hours on the R tutorial on my weekend? Who has spent more than 30 minutes on it? So for some of you, R is really, really, really completely new. And the approach that we're taking today is maybe less structured than you would often like it if it's really, really, really completely new. So we'll take things slow. But if you find yourself getting lost, it's really important that you access Catalina or David and or myself. And we resolve things that you're not clear about. Because you will need to be programming in R from the very first moment, or writing R and using R. And we're not just going to try to tell you things that you can memorize in any particular way. But I'd like you to get that. It's like learning a new language. You can study vocabulary all you want. But really, to learn it, you'll have to speak. And this is what we want to do. So instead of looking over constants and vectors and tables and programming and plots in a structured way, I threw this all out. What we're going to do is very different. Because you don't want to become programs. You might need to know that if you study programming and computer science. But you want to get some biology done. And R currently is really one of the most important tools for that. But what you're really worried about is how it's quite different things. How do I even start expressing my ideas in code? When I see questions on the R help mailing list coming up, I think 90% of the problems could be easily solved if people would spend more time on structuring their ideas and thinking about what it means to make something computable. We often have this reflex, OK, I have a pile of data. I have a data file. And now I take that data and I'll ask on the internet, how do I analyze it? Well, that's not a good question. Having data and then wanting to analyze it is not a question at all. The question needs to be motivated by biology. The question needs to be something like, how is a certain value distributed in my data set? And then we can start asking, how do we generate a process that answers the particular biological question? Well, you might wonder, how do I even get started? I've heard some ideas here that everybody around us is using R and we are not using R yet in the lab, and that's scary. So let's get started somehow. But how and where? Then you maybe do a Google search for R and you get 17,300 pages. And often they're very good, so the important links are usually always on the first page these days. But still, it's very intimidating. R is large. The R community is large. So whenever you ask a question, you'll have a lot of answers. Oh my god, something happened. What do I do now? You'll do the first thing in R and before, and you think you know what you're doing and then R throws up an error message and refuses to go on. And the error messages in R can be very, very, very arcane and have very little to do with what you think the problem might be and be not very helpful in actually solving the problem. So I hope we'll create a lot of error messages during this workshop and work through them and give you some level of confidence that things don't actually break. And then of course, how can I remember all of these functions? And there's a lot of R functions. But the idea here is really, that shouldn't be the question. There is way too much to remember if you want to remember it all. You should really be focused on the problem at hand. You may need a few basic principles and understanding of principles to get started. But try not to start learning things by heart. Try expressing your ideas somehow, seeing how things go wrong, and then fixing it. And fixing it means looking it up, posting a search and writing a search on the internet. You'll often find, usually you'll find very, very good answers for something and working from there. You'll get more proficient with practice. And you won't get proficient by studying textbook-like material at the outset. So that's what I hope you'll be more comfortable with when we go home. Let's see how it works. It's a new experience for me as well as for you because it's a completely new course. So we learn R by working with R. And we look at a problem which is kind of typical, I think, of the problems we have and develop a strategy to solve it. And part of that strategy will involve R. But more importantly, part of the strategy will involve structuring the problem in the first place. And that, in a way, means learning how to learn. And with R, this is particularly important. The answer to, can something be done with R is almost always yes. It's extremely versatile. And not only is it extremely versatile, but many people have posted excellent and high-quality computational solutions. Now, sometimes I find that a bit of a detriment. There's a very simple way to work with R and a very simple coding paradigm. And it's all about writing procedures and putting them together. And that's what we'll do today. But there's also rather more computer science-oriented ways to do this in a very object-oriented way, with functional programming large in your mind, and very efficient packages and subroutines that make work with big data very fast and very efficient. And this is not what we'll talk a lot about today, because that's very problem-specific. And I think often it obscures the ideas of having to structure your work neatly and clearly in the first place. So we're not at all about elegant R programming. And the paradigms that we'll be talking about today are also often not the paradigms of how to write code that you will find on the web and on the internet. We'll do something that's very straight and I hope very transparent. And there's many, many good reasons to do that, not least of all that. The more you can break things down visibly into single and individual steps, the more you can fix it if something goes wrong. And the more you can validate it as it goes along. And that's really crucial. The validation of your code and your programs is going to haunt you for the rest of your R career, and it should. It's easy to get a number, but it's not trivial at all to know that that number is A, correct, and B, meaningful and logical in the context that you've computed it. So here's what you'll need to do to get the most out of the day today. First of all, be active. Try to think ahead, write in your computer. Don't just listen, but try to anticipate what's going to happen. And you should always think, how would I approach this problem? Make the problem your own. And think about how you would approach it and think about, if you don't know how to approach it, what's the question that I need to ask to be able to solve it? Take notes, take lots of notes, write a lot. A, this helps you focus. You're not just, whatever I'm saying or somebody else in the room is saying is not just passing through your mind, but if you write down and paraphrase things, you'll actually get it into your working brain what was just happening. Ask, ask a lot. I expect most of what we're going to be doing today to be developed in dialogue with you with questions and with proposals. If you don't ask, I will ask you. That's often what we do in the lecture room. And then finally, play. If I give you a certain command on the screen and you type that, why not change the parameter? See if it still works, see if it breaks. If it breaks, that's an interesting experience. A, you'll learn what it looks like if something breaks and B, we can perhaps discuss interesting ways to fix it. And I hope overall you'll enjoy all of this. So let's talk a little bit about biology. Just to introduce the paper for which we're going to be asking the context. I am not at all an immunologist. So I have no background knowledge in this. This is as new to me as it is to you, most of you. Do we have anybody who works in hematology in the room, immunology? Oh, wonderful. So we can go to you for more specific questions. Anybody seen this paper before? Well, I emailed about it yesterday, but did you come across this? Massively parallel single cell RNA-seq for marker-free decomposition of tissues into cell types. When I saw this a week ago, I thought, this is really interesting. First of all, it's pretty cutting edge that we're actually able to do RNA-seq on single cells. That's very amazing. And, but this is a massively parallel approach that does something which I find absolutely fascinating. As you know, in immunology, one of the big topics is to describe all of the cells that have to do with the immune system. And there's many, many, many, many different ones. And they have been characterized in many different ways, usually by expression of some cells that presented them. Well, this is an alternative approach, but the approach is also general in the way that it possibly scales to looking at all different cells in our body. How many tissues do we even have? How many different cell types in the body? Well, that depends on how you look at them and how we distinguish them. And this is one way to do it. What these people did is they looked at gene expression levels by counting RNA-seq transcripts and then approached cellular diversity through inferring variable and dynamic pathway activity than rather looking at a pre-programmed era. So they looked at what these cells express and what these cells changed during expression. And then asked, are there similarities? So you have a large number of cells, they all do their thing. You characterize each and every single one of them through expression levels of a large number of genes and you also characterize them through the response to a stimulus, in this case LPS, lipopolisaccharide, which is a general inductor of human responses and see how their properties change. And that I think is very characteristic of many of the types of RNA-seq experiments that we're thinking about. You characterize cells or you characterize tissues or tumors or patients with a large set of expression data in RNA-seq and then you look at how things change. And the overriding question is then, well, if we see this status and we see the response, are there similarities among the cells? Are there similarities among the tumors or among our patients or phenotypes or whatever you're looking at? And once you define these similarities, you can then take this new knowledge and make biological inferences from it. So in a nutshell, what happens is, is there some, Michelle, is there some fancy high-tech pointer so I can do something on this screen and it'll show up? You can call me if you use your mouse. My mouse will work? Is that visible? Yeah, okay. I should have a laser mouse. So we have spleen cells and we isolate single cells from the spleen and put them onto 384, well, microtiter plates and we lyse them and barcode them. So barcoding is the attachment of characteristic DNA sequences with which we can then identify in which pool of this microtiter plate the cell was sitting or where the DNA came from. So in the end, when we throw all of this together, we can trace individual RNA-seq reads back to one specific pot in the microtiter plate or trace it back to one specific individual cell and this way we can distinguish reads from individual cells through barcoding. So once we've done that, we can identify the individual RNA messages. We can pool them in a single cell pipeline. We can amplify them, we can read them and we can do some interesting analysis. Now the interesting analysis turns out like this. We have here a correlation map of cells against cells. All, I don't know, several thousand cells that went into this study. And what this correlation map does, it says for all the genes that the cell expressions, do two cells have similar expression levels? And if the expression levels between two cells are similar, we give them a red dot on this plot. And if the expression levels are different, we give them a black dot. So this is the correlation between all the gene expression levels for all the cells. And then we arrange it in a way that makes it easy for us to see similarities. And what we get here is this block structure after rearranging this. What this block structure means is there are groups of cells among the several thousand cells that have been sequenced. There are groups that look very similar to each other but different from other cells. So each of these blocks is presumably this one particular cell type or tissue type. And in this way, in a completely automated fashion, we can start looking at how many different cell types or tissue types exist in the sample. And the researchers then went on and they did an experiment where they stimulated cells which is summarized in this little plot here. So the experiment here is after sorting cells and identifying them and assembling them into groups based on this initial discovery of correlations between individual cells, you stimulate them with LPS. Why? What's LPS? Can you give us a five second primer on why we use LPS and what it does? No, behind you, you're one of our immunologists, right? Yeah? Right. So we all have an innate immune response and a learned immune response. But if you give immune cells LPS, they think they recognize the presence of bacteria and they become active. So it's a general non-specific stimulator of an immune reaction. And some cells are expected to respond to that and some cells are expected not to respond to that. So for examples, what we see here is these cells are characterized by having high enrichment of expression values in these genes here and there's no enrichment of expression values in these genes here. But if you give them LPS, nothing happens. The highly expressed genes stay on and the purely expressed genes stay off and LPS does not induce them. Macrophages, on the other hand, respond quite strongly. So there's this group of genes here which are borderline or poorly expressed under resting conditions, but they all become highly expressed when we add LPS. And monocytes similarly, highly expressed here. Now is there something that's generally known that macrophages and monocytes have in common that B cells don't? Well, they're from different branches of the hematopoietic tree, the tree of descent from the hematopoietic stem cells, lines, as they differentiate into the different branches of B cells or other cells. These two come from the same type of precursor cells. And so what we see here is that our knowledge about how cells develop and differentiate is actually reflected in this behavior which did not go into building this knowledge in the first place. So this kind of discovering this relationships was done without any prior knowledge of hematology. It just comes out. In fact, it says that we believe that these cells are similar to each other and different from others in particular ways, i.e. in that they respond to LPS. And lo and behold, if we do the experiment, we actually find that that's true. And that's actually quite encouraging if this happens from time to time in science. So and so on. So that's actually pretty neat. An automated way to distinguish cells from each other without any prior knowledge. And this is of course important because how do we know that our prior knowledge, our preconceived notions about how the body works are correct? It's much better if we have an unbiased, completely general approach and then we can look into the data, develop hypotheses and indeed find out that we were correct or generate new knowledge. For example, discover new tissue types that we didn't even know about or get new information about how tissue that we thought we knew very well actually behave in practice if we have a closer look. So I hope this kind of thinking and this kind of approach resonates with the kind of questions that you're interested in. Now, at this point, we can start asking questions. So for example, here's a part of the resulting images here. The researchers automatically classified the cells into different types and used the computational procedure to differentiate them. It's basically a type of cluster analysis and cluster plot. So what you can kind of think about, you identify the cells that have these expression patterns and then you find a way to plot them on some kind of a 2D plot, whatever that plot is. The clusters here then correspond to the blocks that we saw or the gene cell enrichments that we saw. And in order to validate that, they then did a traditional flow cytometry experiment and characterized a few cells by flow cytometry and sequenced these two. So characterizing a cell by flow cytometry means you incubate the cell and then you show some labeled antibody on them that recognizes some cell surface marker and then you send them to a machine that recognizes the labeled antigen and you make sure they land in different pots and then you can do something with them like sequence their RNA and thus look at the expression levels. So for example, if you do this with cells that are CD19 positive and B220 positive, which are two antigens and T cell receptor beta negative and then you sequence them, they all fall into this spot here on the plot from our initial unbiased analysis. So this means all these red dots correspond to known B cells because known B cells would have this complement of surface antigens. And we see that these known B cells actually cluster with one of the genes that we've defined, one of the clusters that we've defined computationally from looking at all of the genes. So this means these surrogate markers here correlate with the internal state of the expression of many genes in our cells. And you can do this with some different markers, for example, for natural killer cells which cluster into this spot very nicely or with plasma lytic dendritic cells which cluster into this spot here or with monocytes which don't cluster very well but are seemingly more diverse in their properties. And in every case, it seems that the traditionally known and described cell surface markers correlate very well with the clusters that the computation analysis stemming from RNA-C has defined. Now for example, we can then ask, well, are these markers actually expressed in the cell? Can we see that the cells which cluster into this group here have a high expression level of CD19 and B220? How would we do that? That's one question that we could ask of the data which has been published with the paper. We could study the figures that they gave us very, very carefully and try to come up to understand their conclusions. For example, for all of the genes that are in this plot here, every single line, every single horizontal line corresponds to a single gene. We could then start asking, well, they've only labeled a small number here but what are these genes overall? Can we identify them? We'd be, for example, potentially interested in whether all these genes are co-regulated with each other and understand something about regulatory networks. So we can start asking these questions but of course the authors haven't labeled them. Can we get this information from the data which they have in the supplementary data? Well, for example, we might see that there's a group here in this row that doesn't have any labels at all. What are these genes? What do they correspond to? They all seem to be highly expressed in the resting state and then switched off when the cells are challenged with an immune stimulus. So are these housekeeping genes that the cells need to switch off when they go into their active immune state? Or are they active suppressors of immunity which might be quite interesting? I don't know. But can we have a look from the data that's published? Well, that may be an interesting question to ask. Or, for example, we see that there are clusters of cells that express particular genes at a high level and the question then might become, are these genes functionally similar or dissimilar? Are they actually co-regulated? Does the observation of significant patterns in the expression levels actually correlate with our knowledge about gene function? So, and so on. So, there's a lot of questions, biological questions that we could ask and I'd like to explore them with you today. But of course, the major obstacle is we need not only to work with the supplementary data about potentially diverse additional data sources and the data comes in very, very different formats. Much of the data you will find in Excel data sheets. Some of the data might be in the handwritten notes of your grad student or simply in text files. Or you might have to work with PDF files and extract data from PDF files. Or go to sources like the Gene Expression Omnibus website of the NCBI that hosts microarray data or the Gene Co-Expression Database that has analyzed gene expression data sets to find genes that are co-express with each other and so on. So, a lot of the tasks that we have to do is have to do with data integration. And the first task always is how do we even get our data in the computer in the first place? This is usually the most significant and most resounding obstacle. So, what do we do first? Well, let's ask this question. Let's just try to download supplementary data from what the authors have worked with, get it into R, and then begin analyzing something about the genes that are enriched for expression under different circumstances or are not enriched for expression. So, the biological question here are the known markers of this gene expressed as expected. So, how would you, how would you, how would you attempt this problem? What do we need? Michelle, do you think we need both blackboards here? Can I erase one of them? Is there an eraser? Yeah. Oh, here, thank you. So, how would you structure this question? Does it, does it even make sense? Was it clear what I'm asking here? How would you structure this? First, you have to introduce that. What are we looking at? So, what do you think we should be looking at? We'll see if we can find the data then, but what do you think we should see? The raw reads, four individual, well, this is a little different than the raw reads. Now, if you're looking at counts, you've already taken your raw reads and then you've associated them with genes. So, we would like to have expression counts. So, in some way that would be a long list of gene A, B, C and P, Q, R, 2 and L, Y, 5, whatever the 21,000 something or 25,000 if you include RNAs or many more if you look at differential splicing or whatever these genes are and then some value, 85 and 42 of course, 11,350. So, this is the type of data which initially we would like to see, right? So, genes and values, yep. I just want to be the circular that we talked about before, let me just say now. Yeah. The seven categories. So, basically it's, the shade reflects the same. The seven categories were the same as the cell types that we saw earlier. Well, that's the conclusion that we have. Initially, this is just the data clustering. So, you arrange data here for cells that seem to have different, similar expression profiles. And now you take markers for these cells, you select them by flow cytometry and each of these cells which have then undergone the same RNA-seq procedure seems to have expression profiles which are similar to some of the clustered cells we've seen before. So, from this we can make the inference that these cells which ended up in this cluster are also B cells, which we wouldn't know because we didn't label them with cell-certed landings. Now the question I'm trying to pursue is to ask, well, is CD-19 which is a B cell cell-certed sanding? Is that expressed? Is that significantly enriched in this group of cells or is it not? So, it's a kind of a validation question there. Thanks for being here. Also, we've set some kind of threshold because, well, what you say is in their cluster, right? So, like if you look at the monocyte, it seems to be scattered all over the place. So, you have to say, what's the cluster of monocyte? So, if you're looking at B cells here, you have to say, are those ones that are sort of scattered down at the bottom of the account flows as B cells? Can you set some kind of threshold for how close they are or how stuck they are into that cluster? Well, that's an interesting question. But this will go very much into the exploratory data analysis. Will you be here in the PDA workshop? In principle, the answer is no, we don't seem to be able to have complete, sharp thresholds. We can just define some kind of boundary between what our algorithm has shown us to be similar data. But that's fine because as the scattered plot here shows you, there's a lot of variation in the natural processes anyway. And it's not such that all of the monocytes will only look like this core spot. We will have different properties and the differences may be as biologically significant as the similarity. So with the B cells, it seems to be a lot clearer that they focus very well. With the natural killer cells, it's not so clear. There's a cluster which correlates with one of the things that we thus label to be NK cells. But some of them would be indistinguishable from monocytes where it's not that we would be looking at the cell surface antigen. So the cell surface antigen are different, but by and large it seems that their expression profiles, the enrichment is called similar. So then we'll have class, well, these cells here, their expression profiles are similar, but what makes them different cells? What makes them want to believe? One should be reacting to LPS as a monocyte and what should not be reacted and so on. But let's try to really approach this step by step. In order to ask very clearly, is CD19 highly expressed in a certain group of cells? List of genes, right? They're primers? Right, so presumably in this list of genes, you would find CD19 somewhere. And we'd like to know what is its status? What's the number that we've associated? And we'll have numbers associated with different individual cells, so 1,000 individual cells, and we'd like to ask, are these numbers characteristic of the ones that we've grouped together? So what's the very first step that we need to do? Okay. Like trying to see if the genes that we know are possible over the CD19. Operated genes annotated as B cell and we'd have CD19 somewhere in there. In principle, yeah, this is exactly what we would be trying to do here. So asking if in the list of enriched genes or operated genes, we can find CD19 and its expression level is high. So the very first step to do is we need to get the data into our computer. Yeah. So there's, you get the data, you can add a cluster on the top and then we'll put it into your genes. So they would get there, we'd need this as a different unit of cell types. Or you can force the cluster with CD19 and we'll see what comes out. Are we saying, I know it's CD19 positive? No. Sorry, I may not have been clear enough. These red dots here are flow cytometric analysis after the other analysis. So this is a second step of validation. So you are like this? So initially we have these black clusters here and we've clustered the data and now we're trying to validate what these clusters are. So we do a separate experiment, we label cells with antibodies and we find those that we call B cells because they have B cell characteristic antigens. And then we see whether the data we get in a B cell coincide with the clusters that we got from all cells. So this is the conceptual validation step that the clustering actually gives us something of biological relevance. Right, initially I can cluster any new thing that we'll see in cluster analysis. Cluster analysis is really easy. You take some data and then you get clusters. But are they meaningful? Mathematically in a sense they are meaningful but are they biologically meaningful? This is what this experiment is. So it's a different, it's being plotted. We'll have to dig deep into the methods supplementary analysis. It's a special way of blocking these particular clusters. I haven't seen this previously. It's maybe something they developed for this purpose with the frames and stats. Did you come across this approach? So essentially you identify clusters and you put the clusters on a circle and you have some function that similar cells should come close to these points on the circle and cells that have properties which are in between circles sort of fall in between. So it's just one way of putting multi-dimensional data into two dimensions so that we can visualize it. Like the clusters that you would get after you do principle component analysis. And you then plot things in two dimensions. So this is super high-dimensional data. Each cell is characterized by a vector of, I don't know, 1,300 differentially expressed genes. But we only want to plot it in two dimensions so we get a nice picture and we can start. This is what this cap x is. Many other different ways to plot things would be possible. Perhaps if we make good progress we can just try to develop and take the data and look at it in a similar way. Right, so step number one is usually always to load the data. And actually that's what we'll spend a lot of time on. And then step number two is to identify data for one gene and print it out. Where this one gene could be different genes. Not just exactly one but maybe lists of genes or whatever. And then we can look at the values in our dataset. So as far as computational procedures go this one is simple and we'll start with a simple one. And that's good. Now the question becomes loading data. This is where we actually start R. You can either work along in R or you can work along in RStudio. For those of you on Windows machines I would probably recommend RStudio simply because it has syntax highlighting that R has in the Mac graphical user interface anyway. So I'll probably be using the plain R interface but the RStudio interface looks very similar. And maybe I'll just be switching to and fro between them. Right, so this is where we actually start with the R code. Now, if you could go to the workshop wiki and there's a file here, 2015 intro module one first steps dot R. This one here. And I would like you to save that to disk. Disk. Well, let's click on it first. The first thing we need to do is we need to set up our workspace. Whenever we begin a new project we probably should be creating a new directory and putting stuff and data into this new project. So somewhere on your computer I would hope you've already installed or begin to install a directory which could be called CPW or something else, which is a sub directory somewhere on your hard drive. These are breadstickers you have to boss. This was part of the homework, 700. Now, setting up a folder like this is the minimum really that you should do for any kind of a computational question that you have. I would really never start working ad hoc but getting the organization done at the beginning helps you a lot in structuring things so that you will ultimately find them and be able to reproduce them and work with them in the future. You can certainly enhance this later on and you could do other sub directories which we won't do. So typically one possible directory would be for data where all your raw data files go so they don't clutter your directory listing all that much. And there could be one separate directory for documentation and so on and so on. But initially you should have a project folder somewhere on your computer to work with. Okay, now I do the same thing on my computer. I now have a folder here which I've just called CVW. When I open RStudio, I can then open a new project with RStudio and I use the choice existing directory associated project with an existing working directory. And the project working directory is this one that I've just created. Now in your R tutorial, there was a little section on files that R executes when it first starts up. And specifically there's a file called Rprofile that sits in your home directory when you start up R and that executes things as R starts up. So for example, this is my Rprofile file. So in this Rprofile file, I can put things that I want R to remember all the time. So for example, one thing that I've did is I've changed the prompt for R, which is probably just one angled bracket for you and in my Rprofile file, it has an R prepended to that angled bracket. So my R template here, my R background looks a bit different from yours in that there's this R here which you don't have. The reason for that is that I've put that into my Rprofile file and when R starts up, it reads the Rprofile file and changes the setting. One thing that is useful in such a profile file is a shorthand notation for the actual directory where your R files sit. Not for the kind of work that we're doing here now, but later on when you will be sharing scripts with collaborators. Everybody might have the same project files, but typically the project files would be in different places on their computers. Now the first command that I always put into a script is an R command setWD, which you've probably come across in the article. SetWorkingDirectory. And this sets the working directory for R to be the folder where you want it to be. And that means if you open files or if you save files, they get opened or saved in this particular directory. And if you don't do that, R will not be able to find your files. So, but in order to do that, you need to define where your working directory is. Now if I write C colon backslash, R files backslash to 2015 backslash, microarray project backslash, Boris wild ideas backslash R folder. This is going to be specific and correct only for myself. If I give that to my collaborators, it won't work. So it makes it awkward to exchange scripts. However, if we agree on a particular name for our project and we put that definition of where that file is on our computer into the R profile file that gets sourced on startup, everybody can use the same alias. So the alias that I've defined here is CDWD. So if I put something like setWorkingDirectoryCBWDIR, I automatically get taken to the correct file. Now, if my collaborator would do the same command here, which I might have in a script, they also would be going to the same file. So none of this is crucial for what we're doing now because for example, if you use RStudio, we've defined what the project file is when we started it up. So RStudio will actually already take us to the right location. But it's good best practice. And it's the kind of best practice that's usually not documented somewhere. It's good best practice to work with in a collaborative setting. So this is why I mention it, mentioning it here. So R starts, R executes your R profile and some things might happen in that R profile. I'm just specifying, like in your case, after the work, it says user-side. So the tilde on max and on Linux computer is my home directory. I actually don't know if it's the same thing on Windows. So that's something that doesn't work generally for everyone. So this is already something that everybody would need to have specific to this. So on Linux computers and on max, the tilde is my home directory. Is there a default home directory for Windows? I want to see like percent or sign home. Oh, percent or sign home? Okay. So that would possibly work in every case. Okay. Just making sure I'm now back again in my project directory. So you have a project directory and it doesn't matter. Let's maybe use our studio. So everybody has the same experience. And I'll probably be using our, well, we'll see. I'll be switching between both. Good. So the first command that I'd like you to type is just get wd. And the result of that should be a string which points to the workshop directory that you've just set up. The next thing is to get this file. This is just the file on the wiki that we just went to. So this intro module first steps accesses first this file and get it on your computer so that we can load it. And there's many different ways to download data files or text files from the web. One easy way would be to go to our studio and to open a new file, an R script, select all and copy and paste. So then the whole file is here and when you've done that, you see that this file name now is, or the tab label is red. The tab label turns red whenever you have changed something in your file. It's a reminder that you've changed something and you probably want to save it. So that's what we'll do now, we'll save it. Our studio is smart enough to not have us look everywhere on our computer but to put us into our working directory. And we'll just call it first. You are directly where you want to be working, please put it, please don't look. Don't do that. I might take it from your head. I don't want to go back in there. It was offensive. Try it, good question. I suspect it won't work because I suspect that it won't automatically update the working directory definition in the project file, which it sets up. So our studio sets up a project file where it keeps a lot of information about the project. So I don't think that gets automatically updated if you move the file outside of our studio. But maybe it does. I've never done it. But if your folder is really in the wrong place, maybe the easiest thing is to just create a new one. Well, at that point, if you start reorganizing things, then just say file, new project, start a project in a brand new working directory, define that, and then once it gets created, copy your files over there. I mean, it's such old and it's so young. It said it made the need and I said yes. Okay. And then when I went out to work, I was in my sub directory. Okay. So it has to be. And now, very likely if you close our studio and open it again, it will be in the right place when it starts up. It remembers the right place if you've been there before. The key at this point really is to issue the get working directory command and find that the result of that is the directory that you'll be using for the projects for this workshop. You can then also issue the command list files, which tells me what files I have in that directory. It's just a directory listing. So something that's often a bit awkward is to find the right directory in the first place. And I just like to show this to you because it's really neat in the R interface. It doesn't work in the same way in R studio, but if you drag and drop a script file into a new file in R, it interprets this as a command to source that file, i.e. to execute the commands in that file. And part of that command is it expands the full path of your directory. And you can then just copy and paste it from there. So... Safe and they can see that it has to be correct. So please put this down now because there probably something else to do. So as with many things, there's also a menu command to do that in R studio. If you go on session, set working directory, you can set it automatically to the source file location or choose a directory. And that's of course also a way to get a file, a complete file path because I can choose any directory on my computer. And then after I've chosen that command, it gets me the full path. And so I know how my computer understands that particular string. It can be a bit tricky. For example, on Windows computers, where file directories are separated with a backslash, that backslash has a special meaning in strings in R. It's an escape character. For example, if you have the character N prepended with a backslash, this means new line, right? So in order to understand backslashes, therefore in a Windows file, R needs to somehow specially handle this and know that these are not escape characters. So one way to do this is to escape the escape character. So a double escape character is in a string just a backslash and not an escape character. So this string now is backslash N and that's what you would need to do to properly specify strings that contain Windows file paths. However, it's a bit easier because if R recognizes this to be a string, a path, it will automatically translate it for you and do this as a forward path. So if you say get WD on a Windows computer, do you have backslashes or forward slashes? So forward slashes this way. But that's not actually the right syntax for Windows because on a Windows computer, it is actually backslash. So R automatically translates this for you, uses the forward slash for all its representations, but when it issues a system's command, it actually issues the command in the right way, i.e. in a backslash way so that the operating system understands it. So don't be confused by that. R is doing something automatically which is actually helpful. So I usually write set WD whatever path as the first command in my scripts. Now using scripts, and that's a really important one. You can do everything in R just by typing things on the console. So I can work in the console, issue commands and do whatever. But even though what you do on the console is saved when you exit in a so-called R history file, you'll really have to consider this to be volatile information. It goes away. And as I said, R is big and complicated and you're likely to make mistakes. I make mistakes all the time. I type something and then it doesn't work. And then I go back and fix it. And for example, I can say, I type two times pi here. What I actually wanted to do is pi squared. So I can use my up arrow, retrieve the last command, edit it and change it to pi squared. And thus I can issue command after command after command. But a much more sensible way to do this is simply type everything you do in the script file and save that script file. The really important thing about that is your research, your work becomes reproducible. It's easy to add comments. It's easy to understand later on what you were doing and update yourself. And it's also then much easier to reuse code that you've written one time for another thing. So my work in R always goes into a script file. Even small and trivial things, I write into a script file. They have a habit of not staying small and trivial but becoming larger and more convoluted and more complicated and becoming a real development. And then a script file is where they would need to go. So let's start with a script file for this day. I have a little template, an R script template that contains a few things that are often useful to have in a script file that you develop for some kind of a small project. So open this link on the wiki and open a new file on your RStudio and copy, paste. And I will name it firstscript.r. Now strictly speaking, putting this into the first line of your code is completely superfluous because you'll know what the name is once you save it. But I find useful when I switch around with many different windows open, it's often useful to check here. And I can write some purpose. Versions are really useful when you want to do reproducible code. Save a backup every now and then and continue and give it a new version number. Date and author and so on. You can identify important metadata about the project that you're working with. For example, you can identify the kind of input your project uses, the kind of output it creates and the kind of dependencies it's needs. Dependencies would be data resources or code resources or some kind of utilities. Usually your scripts are incomplete. They're like poems always incomplete until you abandon them. So you might have a list of to-do things that you would like to do next or notes to other people who want to use your script and so on. So here's the first command that we have in here and we can use parameters, packages, functions, maybe at a little later time. Right now, I would just like to have a little space here into which I can type things as I go along. Now I'll save this. Good point. Okay. Now the way that probably maybe the most productive way to work is to copy things from the first steps code file that you want to keep and paste it into your first, what happened now? And paste it into your first script command. And this is also where you can type all your notes and comments and things that you want to remember so that your notes and the things that you've done in R are all in the same place. Now this is a command and what this command does, it gives me a directory listing of my home directory and without being able to try it. David, is that the right way? Okay, so this may or may not work if you're in Windows, give it a try and it will find all the files in your home directory or not, if that's not the right way to specify it, that end with the string text. Now I have this in my source file but of course, essentially what I want to do is I need to have this executed in R on the console. So I could take it from the source file and paste it into the console and then press return and it tells me that this command yields 10 text files in my home directory. But of course that's awkward, always copying and pasting things around and there's a much more elegant way and that's if I have my cursor in the line in the script file and then on the Mac press command return, it automatically gets executed in the console and I think on Windows I need to press command R. Okay, right, so the tooltip tells me, I've never used this, I just know how to type it. The tooltip tells me this means run the current line or selection and that thing with the current line or selection is important because it's a convenient way to specify parts of R commands. So for example, these are actually two nested commands. The inside command is this directory command and the outside command is the length and the length command gives me the number of elements in the vector. So if I want to execute only the inside part, I just select it and then do this and I get a vector with all my file names in my home directory or I select the whole thing and I get this here or I select more than one and I get 10 for my home directory and zero for this because this directory doesn't exist on my computer. So that's the convenient way to work with R scripts. You type your commands, you select them or you just place the cursor into the line. You execute them and then you change things around and re-execute them. So for example, I can change the pattern from dot text to dot doc and I see I have one doc file in my home directory. Okay, now in the console, we can use the up arrow keys and change things and execute them again to retrieve and edit previous commands. In the script file, you've already got your command listed. You can also use the command source for a particular file name to execute an entire script at once. So maybe just to make sure that everything works, we'll do that. Try the source command. As a task, I would like you to, okay. So I'd like you to create a completely new R script. In that R script enter just one command, print, open parenthesis, quotation mark, good morning or time for coffee or hello world or whatever and save this in your project directory under the file name test.r. This is a file which we'll just use to try things and then throw away again. Then you go into your script file, first script.r and you type the command source test.r and then you execute that line, source test R. R will then open the script file, execute all commands it finds in there so it will greet you with a hearty and heartfelt good morning and then close that file. And maybe we'll change it to time for coffee instead. And when you're done with that and your source command actually prints time for coffee, you've earned yourself a well-deserved coffee of our first morning break.