 OK, so we're back. If you're not in the classroom but hearing this recording offline, welcome to you too. Where was I? See, that's modern technology. It always throws wrenches into your gears. Right, so we'll heavily lean on biology. But working in a computational sense with biology is above and before all an art of abstraction. On taking information from biology and casting that into some kind of a conceptual framework that allows us to represent biological facts in a computer, to store them, to retrieve them, to manipulate them, to compute with them in many different ways. And it would not be obvious why we would be using R to do that. In fact, bioinformatics as a history 10 years ago, all of bioinformatics was done in Pearl, which is a convenient scripting language. Then people found Pearl not so convenient to work with, and many things gravitated to Python. And then all of a sudden, R came up, and everybody adopted R. And I think now it's fair to say that almost most of bioinformatics day-to-day work is in some way connected to R. And why is that? Well, R is a curious language. Syntactically, it's at some level pretty straightforward. You write things in the computer to operate some things. In a way, it's a functional language. It's designed as a functional language. So the function is the central idea here. We have data that go into functions, and data that go out of functions. And this idea of looking at the world through functions and not through static data is quite conducive to the way we do our everyday work. We pull data in into our systems, and then we operate on the systems, and we put data out. So often, we can sketch what we do day-to-day as a data flow diagram. And the individual steps of that data flow are well supported with things like R. But what's a really interesting aspect of R is it was initially written as a statistics workbench. But it was written in a very general way so that it's easy to use it in different domains as a DSL or domain-specific language. I think this is the most prominent and least understood feature of R. It's actually an excellent language to write computer languages, i.e., domain-specific languages. And statistics was perhaps the first example of a domain-specific language built on top of R through all of the extensions and packages and special functions. But it's very easy to do that in different domains as well. And one of the domains is molecular and computation biology, especially through the bioconductor packages that have great support, really excellent support, for many, many aspects of high throughput biology. So R is a scripted language. It's very flexible, which is also important. Because if you think of a, who here has ever worked with Java? No Java? Fortran. Fortran? Really, yeah. You've really done Fortran? Oh, wonderful. On a Vax? Oh, amazing. Undergraduate computer science. OK, C. Lauren has done C. Awesome. So what else is there? Pearl? Python? A few Pythonistas in here. Ruby? Nobody's on the web. JavaScript? Ew. OK, most of you, I think, have not worked with computational languages before. OK, good. So that's excellent. That's turned to philosophy for a moment. There's a very well-known book. I think it's from the 16th century. It's called The Gateless Gate. And it collects a number of Zen Quans. And it tells the story of how Hogan, one of the patriarchs of Zen philosophy, visited the patriarch Jizo in Japanese, or Di Zhang in Chinese. Jizo said, where are you coming from? And Hogan said, I'm on a pilgrimage. And Jizo said, what is your pilgrimage about? And Hogan answered, I don't know. And then Jizo said, not knowing is the most intimate. And that's at the core of Zen philosophy. Not knowing is the most intimate. Not knowing does not mean knowing nothing. You all know a lot. You all know a lot about biology. You're all experts in molecular biology. It just happens, you're in a state of not knowing about how to write computer programs. But that doesn't mean you're at a disadvantage, because you have very clear goals and very clear ideas of what you need to do. And the only small step that you need is after you've become intimate with the problems that you have, that's the most important part, then solve the few easy steps to actually make them work. So this is our motto for today. Not knowing is the most intimate. Some of the things we'll be doing today is you should be able to start up and work with R and RStudio. I truly hope you have installed RStudio on your computer and that you have installed Git. And you can download packages and projects through Git. If that's not the case, you're in trouble and we'll need to fix that. So we'll get to that very soon. Of course, Greg and Lauren will be able to help you. So it's not that you have to wear a cap and stand in the corner. We don't really do that anymore. We'll briefly touch upon configuration files. I'm not going to demonstrate a whole lot, but we'll do a little bit of things. How to customize our sessions. Probably more important to know what the reasons could be if something doesn't work as intended. We'll work with projects and we'll talk a little bit about version control. This is part of making you think about programming in a more structured way. Not so much ad hoc, but in a way that's actually reproducible where you have scripts that you can revisit and edit and where you can send off a manuscript. And then after the reviews come back, not even nine months later, you can then go back to your script and actually reproduce what you did and make the few slight changes that the referees might have asked for. We'll do some basic R commands and above all, we're going to go through a number of examples of two structure computational tasks and cast them into an R script. And in practice, we'll talk about being able to read data, to select, to filter, to rearrange, to combine, to basically start all the little ways of analytics that allow us to look onto and into data. And we'll write little functions and programs and create simple analysis. And above all, you should have an idea of how you can write errors in your code, what will happen if you have syntax errors, how you recover from that, and how you can fix it. And that's the funnest part. So a lot of it will be hands on programming. But as usual, programming shouldn't be the majority of your task when you write computer code. The most important part is thinking. So we'll all spend more time about how to structure questions, how to think about them. And then I hope it should be easy to cast them into code. So introductory resources for R, there are many introductory resources on the web. Mine is probably not one of the bests. But what we usually do in these introductory resources is we'll walk you through the components one by one. So if you've done the R tutorial, the introductory tutorial, we start with constants. And then we talk about scalars and about vectors and about packages and tables and data frames and so on and so on. And what we're trying to do today is a bit different. Because most of you don't really want to become programmers. If you really want to become programmers, you'd probably already be programmers. You'd have started writing your first public Linux kernel in high school. You want to get some biology done. You want to focus on the biological questions. And R is a fantastic tool for that. But the kind of stuff that you're really worried about is not how to write the best program, but how do you even start expressing an idea in code? How do you even get started in the first place? What's the first thing you do? Oh my god, something happened. What do I do now? So that will happen a lot. And how do you keep up with things? Very, very difficult. And how can you remember all these R functions? And that's what I hope you'll be more comfortable with when we're done today, except for one of these points. How can I remember all of these functions? That's often overstated. You have to get a sense of the question that you ask. And then it will become easy to find a solution in R how to do it. There are dedicated resources on the web. There is a very active R help mailing list that I contribute to from time to time. There is an R section on Stack Overflow flow. That's very helpful. Unmentioned bio stars. Apparently, there's a new Stack Overflow on bioinformatics too. So there's a lot of resources out there. And if you ask questions to these resources, people can be extremely helpful if you ask them in the right way. And the right way to ask them is, well, there's two ways. So the best way to ask it is, if you are able to ask your question like a puzzle, everybody loves to solve puzzles, especially computational puzzles. So if you can really put up a small reproducible example of where you're stuck, where you have five lines of code and then an error message, and that's really reproducible, you will have a response to that in no time at all. And then you can move on. And the second way to get help is to claim boldly something, something, something, cannot be done in R. You'll get about 50 replies. 47 of them irate retributions and three useful comments. But they'll also be very quickly there. So asking questions correctly is important. So we'll learn R by working with R. And we'll look at a typical problem and develop strategies how to solve it. And part of the strategy we'll involve using R, but really it's about learning to learn. That's the part I cannot really teach you. That has to come from you. So what you'll need to do is participate actively, ask questions, bother our TAs. They don't want to sit here and be bored. And if there's anything you don't understand, don't let it pass. Letting it pass, that's the step of knowing nothing. You're allowed to not know and then fill in the gaps. But knowing nothing is not a good thing. OK, be active, think ahead. We work on questions. You should always think for yourself, how would I approach this question if there wasn't this guy standing there and telling me about how to do things? How would I do it anyway? That primes your mind to be able to engage with and derive some meaning from what I say. Because our approaches might be the same. Yay, that's very good for you. Or our approaches might be different. Then I might have to learn something about how to understand your problem better. Or you might have to learn something. Take notes, write a lot. This helps your focus. When I sit in seminars, I never have idle fingers. I always write notes and paraphrase what I've just heard. I never look back at my notes. That's not necessary. That's not what they're written for. They're written for to stay focused in the moment and make sure that the brain has actively engaged with what's being produced. Ask questions. Play with things. Copy code, write it, execute it. See how things break. See what happens next. And then try to fix it. That's the fun part about it. Finally, getting the computer to do what you thought it should be doing. Not just what you actually told it to do. So if you navigate to this GitHub page, there should be instructions and links about how to access the first project here. And we'll get to that in a moment. Anne has mentioned the green post-it. That's our green post-it. And we will mostly work from scripts and resources that I have bundled into a project. So for this day, one R project. Now, this is a level or two beyond the simplest way to use R. Of course, the simplest way to use R is just to use R on the command line and type commands into the command line. I hope quickly appreciate the benefits of working with projects. And we can install projects via our studio from their GitHub source. And we will load our first project now. So the most current version of this introductory tutorial lives at this page, hdpsgithub.com. Hugin R-intro. And you will be able to access it there for the foreseeable future. Now to load it, look at step seven. You should have created a project directory. It was part of the introductory tutorial. If you have not created a project directory yet, create one on your computer now. It's easiest if it's not many, many, many levels below your home directory, i.e., for the purposes of efficiently working today, it would be the easiest to have a directory right below your home directory. And try not to put any spaces, underscores, hyphens, colons, parentheses, question marks, and so on into the name that will reduce some of the problems that you might encounter. Then the first step is to open our studio, select File New Project, click on Version Control, click on Get, enter this string here as the repository URL, type a tab character, and so on, and so on. And I'm going to load my local version here. And when everything is correct, it should look something like this, with a list of files on the project here, with one function that has been defined with our today's project script here, and the program prompting you to type in it to begin. We might look at that function. Basically, what this does is it copies and renames one of the files. And as a result, you should then also have a file MyScript.r among your files. If you do not have a file called MyScript.r, put a red sticky on your laptop. If you click on MyScript.r, it opens this file in your editor. And this is the file that I've prepared so you can write your own notes and program examples and play with things and experiment with things. So put things in here, edit this file, save it from time to time. Now, why is that? It might happen that during the course, we update some of the files, and then we can update things simply by re-downloading all of the project information that we've changed from GitHub. The term is we can pull things from GitHub and refresh it and thus add new information. Now, if you've edited any of the files, like for example, this file here, and saved and committed the changes to version control, then GitHub will complain and say, there are unsaved changes here, and I don't want to overwrite them. You have to reconcile that first. And that's, of course, not what you want to do. So what we do instead is we create a file MyScript.r. It's based on a template that's downloaded with the project, but it doesn't actually exist in the project that's downloaded from GitHub. This is now new and unique on your computer. And since it's new and unique on your computer, you can edit it at will, and you can save it, and it will not be overwritten when we update the project perhaps at some other time. All clear? So your notes go into MyScript.r, but we will drive them with this script. And we can go through the script line by line. We'll try to understand the code, and we'll execute code either by typing it on the console, like so, or by selecting something and then pressing Command Enter. I think this now, most of you have max now. How many Windows computers do we have in the room? About 1 third. So I think Command Enter also works on Windows, does it? Control Enter? OK. So whenever I say Command Enter or just execute a statement, it means select it, and then either press Command Enter or Control Enter to execute it. Now, to execute things, there's several things we can do. Either if our cursor is in a line like in line 102 here, and we press Command Enter, then the entire line is executed. We can execute more than one line by selecting more than one line and executing that, and then it's done all at once. But we can also select less than one line and execute that alone, like so. So that's a very convenient way to reproduce work that you have in a script. There's a table of contents here of things that we do from time to time. I might refer to this table of contents. Of course, in this computer script, we don't have pages, but we have lines on the left-hand side. So section 2 storing data in R would start at line 299. Of course, this script is editable. So if you delete any lines or add any lines, then your line numbering of the table of contents doesn't work anymore. So you'll have to mentally do some arithmetic, which might be fun. Where else are we? OK. So my script.r will be your lab journal for the session, and most of what we'll do now will come from rintro-r. So let's try some very, very simple things. Let's try to get r to do something. The layout of our studio, I hope that you're vaguely familiar with it, this is our editor Payne. Payne as in P-A-N-E, not P-A-I-N. Payne, like window Payne. Not like having to sit in a workshop while the sun is shining outside. It opens scripts. If you open more scripts in tabs, and you can open them and close them again, this is the environment Payne. So the environment Payne is kind of important, because it lets you see what values you have to find in your so-called environment, which variables have been assigned and instantiated with values. And it also tells you something about functions that are known to the system. So for example, are we going to look at this init function? Anyway, the init function was preloaded when the project started up. And it just tells r to source a particular file. This means run a file. So when r started up, it produced a little message that I had told r to produce, which asked you to type init. But the function init had already been defined simply to take the file init.r and to source it, i.e. to execute the file. The source command allows you to embed files and scripts in scripts and have relatively complex architectures. It's not always a good idea to be complex, but if you need to pull in code from somewhere else, you do it with the source command. So what's in init.r that we've just run? This is an r script, so it executes r functions. And what it does, it creates a local copy of my script.r if that hasn't been done yet. So as I just said, it takes a little template we have, and it makes a copy and renames it so that we can edit it without interfering with the version control later on. And the command here is a conditional, if not file exists my script.r, then file copy temp.r to the file my script.r. That's very simple. Temp.r is under version control, and it's the same contents as my script.r. So this allows us to initialize a file with the contents. Then we can edit it, but once it exists and once you've edited it, it will not be overwritten, because the file copy command is only executed if the file doesn't already exist. And then it tells r to start out by editing the source file rintro.r, which has these commands. So just to make it a little more transparent what happened when you type in it, we can put all kinds of little things into such a startup message. And if you're interested, you can ask me why we do it that way and just don't run it all automatically. It's a little bit involved. OK, so assigning variables. A single letter or a number of letters or a word or a name usually is what we refer to as a variable. It's a placeholder into which we can put information and in which we can store information. So here I take the letter 1 and assign it to the variable x with this command. Now some languages would use an equals sign at that point and define x equals 1 or x colon equals 1. The people who wrote r decided to use this little arrow operator. 1 goes into x. And initially, coming from many different other languages, that looked very strange to me. And r is, I believe, the only language that does that. But if you get used to it over time, you learn to appreciate that it's then very, very hard as we do in other languages embarrassingly often to confuse the assignment operator equals with the equality operator, the logical operator testing whether one value is the same as another value. So we'll get to that later, but testing whether the same thing is whether x now equals 1 is done with the double equals sign in r. This is the assignment, and this is testing for equality. Now if I execute this line, what would I get? If I execute the line that I just put on the console, what would I get then? False or true or null or nothing? We'll discuss a lot of that. False is correct. Why false? Didn't we just assign 1 to x? Right. I didn't assign 1 to the letter x. I assigned the letter 1, the numeral 1, to the variable x, not the number 1. So computers are picky about the types of their data. You have to distinguish numbers. Actually, you have to distinguish integers from floating point numbers from characters, or more generally from string data, and from logicals. Now we'll harp on that from time to time as things go on, but it's simply important to remember this 1 is not the same as this 1. This is in quotation marks. It's the numeral 1. So if I say this, come on. Are you kidding me? I go on for half an hour, and it does that to me? That is, anybody know what's going on? Lauren, Greg? There must be some kind of coercion going on there. But OK, I need to look that up. I've never come across that. That's so weird. Does anybody get false instead? You get false? You got nothing? You got true? Who got true? And who got false? OK, we'll figure that out. We'll figure that out. That's not what was supposed to happen. Well, OK. OK, we'll figure that out. This is truly amazing. Maybe I'll have to post that on the R-HELP mailing list. I wouldn't have expected that. So testing for equality can have its pitfalls. OK, so here we've assigned the letter 1 to x. And we know that it's letter 1 because if we ask what is x now simply by typing x, we get the result 1 in quotation marks. If we were to assign the number 1, then we get 1 not in quotation marks. And that is actually not the same thing. So I don't know what's happening here. We can assign other things to x. So we can overwrite it and get new values for x. So pi, for example, 3.141593 and a bit. So simply typing x is actually the same thing as doing this. Print x. Print is another way to make the contents of a variable visible. So the difference is that print is a function. So print takes the contents of the variable x and displays it on the console. If I simply type x, this is an expression. And the result of a variable as a single expression is that variable itself. So that gets returned. It looks the same, but it's slightly different. Now, assume you've done that and you remember from high school that pi has a few more digits than that and you would like to understand how many digits this r actually store for pi. Is it just that? Does it truncate that? Or is the length of where things get printed limited? So imagine that you're not in this room and you don't have Greg or Lauren or me to ask. So what would you do? How can you find out how many digits are stores and how you can print them all? It's your first question about r. How do you solve it? What's your strategy? Google is a good answer. I find Google more and more useful for asking r questions. I usually just type one r in front of the question and then I get answers. I don't know if Google simply remembers that if I ask about r, it's not like in abbreviated text typing r, u, serious, or something like that, but I'm talking about the language. So I don't know if it remembers that or if it's become smart to contextualize it, but that usually works well. r and then some question and in any free form will usually get me the right answer very quickly. So what does Google say? Digits option. Use the digits option. You are absolutely correct. Graham, what does it mean? Use the digits option. We have two grams? No, you're great. Sorry, functions have the option, so it's not the default. It changed the default. Of what? Of the function print. Of the function print? That's pretty good, yeah. Did you know that or were you able to just infer that on the fly? It just came to you. Right. So functions have arguments, and arguments can be instantiated with parameters. So we also colloquially say functions have parameters. Print has a number of parameters. Actually, it has so many parameters. So if we type question mark print, we get in the pane where we have information about files and what packages are loaded and plots that we've made. We also get the help window. And we can type things that we want to help about here, or we can type question mark function name here and then we find information about print. So this is a typical R help page. It has a mention of usage, so it minimally needs one argument, what it should print. And then the three dots mean there can be many other options that you can pass to it. No, as you've seen, just print x is enough. Because I've assigned pi. Right, so if I say print, OK, q happens to be a function. OK, here we go. Error in print q, object q not found. So capital q has not been assigned yet. Right. So you have a number of arguments. I think I said arguments and parameters are often used interchangeably. Formally speaking, an argument is basically the slot in which a function receives information. The parameter is the value what we pass into that slot. But if you say arguments when you mean parameters or parameters when you mean arguments, that's all good. So one of the arguments is, so the argument is essentially the slot where the information comes in. The parameter is the value that we put into that slot. So x here is the argument and q here is the parameter. But it's still called x. So let's just call them parameters or arguments. Don't worry. There's a big Wikipedia page on that. It's quite long, actually. And one of the options or one of the arguments for print is digits. So that allows you to specify the number of digits that we use. Now, here's the way you do it. If we have additional arguments, we separate them with commas. So there's a list of comma-separated values within the parentheses that specify what the print function is going to do. So we type digits. Actually, our studio knows about the valid arguments of print. So as soon as I type more than two or three or more letters, it will autocomplete what it knows about functions here. In this case, it's digits. And I can just tap to complete it, which is very convenient. I can also hover over print. And then it will show me in this little yellow box here the function signature, i.e., what the function name is and what the list of arguments is that I can pass to it. So the function signature in this case, as I said, is it needs at least one thing it should be working on. But then you can pass a lot of other options. And in this case, let's say digits is 10. So we get more digits of pi. Now note one small difference. This is inconsistent. But within argument lists, we use equal signs to specify the values of the arguments. So if I do this here, again, my world breaks down. I thought this didn't work. It didn't use to work. Anyway, it appears to work. We just don't do it. Nobody does it that way. I've never seen it in a drive. I thought it didn't work. Maybe it didn't work. No, it works. See, we all learn something new here. That's very funny, actually. Anyway, everybody just writes it in this way. And even though it would seem less inconsistent to write it this way, there's a balance between being consistent and being understandable. So yeah, try to write in the way that somebody would expect things to work, not just for your own sake, because you will want to read and understand the code that you've written half a year ago or nine months ago when you first sent in your manuscript to the journal, but also for whoever is going to maintain code after you. That may be a project student. That actually happens quite often. You write code, and then you think, oh, there's just five lines, and I'll throw it away. And then the five lines turn into seven, and then 15, and then 124. And then all of a sudden, it becomes a project, which you've never intended it to be a project. And at some point, you hand it off to a project student, and now the poor guy has to make sense of what never ought to have been more than five lines in the first place. So put in comments, structure it well. It will make everybody's life much easier. OK. Now, we have 12 digits here now. What you notice, if you look carefully, is that digits are not simply truncated but rounded as they're being printed. But yeah, 12. Is that all? Can we do 13? 14? That's weird. Stand that one. What happened here? Why are 13 digits the same as 12 digits? But 14 digits have two more. Now put it rounded at 12 digits for 13 as well. You have the right idea? What happens with 13? Would you zero at the end, right side? Exactly. It would be zero at the end. And apparently, print has some internal instruction, not to print trailing zeros. So this is why 13 was truncated. So how many maximally can we get? 22. OK. So now we finally have our answer. What did we need to actually come up with the answer? On one hand, we Googled for it. On another hand, we asked around people who know more than we do. Above all, we experimented and played with it and tried to reconcile if things happened that we didn't understand. And that's really important. In order to learn, you have to learn actively. And in order to learn actively, you need to experiment and do things and type things and play things. So now we've solved our task of how to control the number of digits printed in a print expression. You can set that as a global value, which is probably not a good idea. You can set many things as global parameters. But in general, you often use R packages, i.e. you import computer code that somebody else has written. And these packages usually make assumptions about what global options are being set, i.e. that the global options are set in some default state. And if your global options are not in a default state, then there may be subtle inconsistencies that at least make it awkward to work with a package and at worst can lead to erroneous results. So I think it's generally a bad idea to change global options. And if you need to change things, change them locally for one script and one script only. OK. And then there's the function Sprintf. One of my favorites, because Sprintf actually really gives the most control over what we are trying to do. Now, the syntax is a little bit arcane. So Sprintf, if you're familiar with C or C++ or Perl, which has the same function, or Python, which has the same function, they all basically pull in that function from a C library on your computer. It's really very useful. So the way Sprintf works, or one of its cousins, is you give it a format string. And that format string says, here we have a string that you're supposed to print. And that string contains placeholders for variables. Every percentage sign is a placeholder for a variable. And you can have different placeholders for strings and for integers. And you can control the number of digits that are being printed. And you can control whether there should be leading zeros or whether there should be trailing zeros. Or if it's a string, whether there should be leading or trailing blanks. And you can do all manners of things that make it easy to make good formatted output. And so this expression here means print me a string with a placeholder where I have a 50-digit long number, a 50-digit long floating point number, with 49 digits after the decimal point. And the variable that I want to use here is pi. Now, pi is, I haven't assigned that, but pi is one of the inbuilt keywords in the language. So there's a couple of these things. Pi is one, letters is another, and a few more. OK, so if we execute this, we get this. I thought it was 22. So what this says is that even though print is limiting it to 22, the internal representation of pi inside the computer actually has more digits than that. Oh, yeah, is this correct? There's a zero at the end because I told you, you have to give me 50 digits. I don't know, is this correct? I never actually checked this. See, I should have checked this. OK, so actually, even though we get 22 digits, the number of correct digits is less than that. Now, why is that? If you wonder why that is, you have to think about how a number is actually represented inside a computer. So what we're printing here are decimal versions of floating point numbers that are represented in some binary encoding of little ones and zeros, which are fundamentally transformations of powers of two. So you can't actually get any single number with an arbitrary amount of precision and store it on a computer. There are gaps between the numbers that you can represent. So what your computer operating system does, it chooses the floating point number that's closest to the requested number when it does that. So for example, because 0.3333 something something something 148296 is the closest number that can be exactly represented as a floating point number on the computer, which is stored in the computer. So if we have, and this can be important, I'll get to you in a second, so if we have something like x is 1 third, and then we ask 3 times x equals to 1. OK, something weird is happening here. OK, so apparently that's good enough. I'll figure out an example here. The point which this example didn't show what was supposed to show is, if you test for equality, it can be that mathematically speaking, two terms are identical. But computationally speaking, they are subtly different. And then if you use them for a logical test inside an r function, you will get a false where you would expect a true. Simply because of the limited resolution of numbers that can be represented on a computer. And this happens so often that in the FAQ for r, the frequently asked question is it has its own section on why r thinks that something that you think is equal is not actually equal on the computer. That is section 7.31, FAQ 731. It comes up so often that we remember that. So what happened here is that we requested 49 digits after the decimal point from Sprintiff. And it gave us the best representation it could according to these 49 digits. But the way that pi is stored is a different number on the computer. So more and more digits can be printed, but they become meaningless. There are packages for arbitrary precision arithmetic. If you need that, in molecular biology, we usually don't. 50 digits of pi, I think, is determining the radius of the universe to better than the diameter of a proton. So three or four digits usually is good enough in molecular biology. OK. Now, r functions, r objects, have properties. And sometimes, especially when you're learning about things, it's important to look into all of the properties of an r object and inspect it. And I've written a little sample function that combines a number of ways how you can inspect properties of an r object. And I've called this type info. So if I type this, it will tell me that the function does not exist. Why? Well, it's in my script, but I haven't actually defined it. So simply having something in your script is not enough to let r know about what that function is and what it does. So in order to define the function, we have to take the function name and assign the actual body of the function to this function name. Easiest way to do that, of course, is to select it and then execute it. Now, this is the basic anatomy of how a function works in general. A function has a name. To that name, we assign the keyword function and then a list of arguments in parentheses. So whatever I type as an argument into the parentheses, then becomes known locally to the function as the variable x. So if I type info 5, every mention of x inside the function now becomes 5. If I type info pi, every mention of x inside the function then becomes whatever pi is. So this is the way we get values into a function. We pass them through the argument list, very generally. And we give them a name in the argument list and that name then makes this value known inside the function. And so we do a couple of things here. First of all, print x digits is 22. This is what we've seen before. And then we print the string str, str shorthand for structure. This is a different kind of print command. Cat comes, it doesn't have anything to do with kitties and doggos. It means it's a short form of concatenate. It comes from very, very old computed usage. But essentially what cat does, it likes print. It takes a string and displays it on the console. But print does a little more. It also displays this little thing here. And this says, essentially, x is a vector of length 1 and its first element is this value here. So it tells me a little bit more about this. Cat doesn't print this thing here. It just prints me the value as it is. Implicitly it also adds a new line. So this just prints str. It has to be. If you type it, type with a little i, it will work. Exactly. So r is a case sensitive language. Brings me very briefly to the topic of variable names. So there would be several ways to type variable names. We could do it this way. But then sometimes words run into each other and give strange combinations. Some people do it this way. This is what we call pothole case. I think it's a good way to separate words, except for the fact that we usually try to limit ourselves to maximally 80 lines so that everything can be seen on one line of code. The r editor doesn't explicitly wrap code. So we try to keep things on one line so we don't have to scroll to the right hand side. In order to understand what code we've just written, it's actually quite important that we try to fit everything on a single screen. That's also a good golden rule of when you write functions and programs, how much information should be in one block of code. It should never be more than fits on what you can see at one time. If it's much more than that, you really ought to break it up into separate functions. People have run experiments and have demonstrated that code understanding drops dramatically if functions extend over one page. So what this kind of pothole case does is it wastes one space. And especially for long variable names, it becomes very, very, it becomes a lot more difficult to keep things on a single line. So that's why I personally avoid it. Some people use this case here, which many r functions actually do use this, the periods. That's perfectly legal syntax. However, in particular types of r objects, periods like that have a functional significance. And then it's very, very unfortunate that you can't distinguish whether this period actually has a meaning in terms of the subclass of an object or whether it is simply a period that has been used to identify two separate words in a variable name. And I think that's very unfortunate. So I personally prefer, in my projects, to usually require this so-called camel case. It's camel case because it has humps in the middle. So this is why this is called type info. It separates the words. It's a compromise of readability and utility. OK, so cat str simply types str. And the function str gives information about objects. And that's superbly useful because sometimes you need to look into object and say, is this a data frame? Or is it a list? Or is it a vector? Or a matrix? And it's important to keep these things apart. And if it's either or that or something else, what's inside? And the function str or structure allows us to look inside. So in this case, str says this is a numeric variable. And it has a value that is close to 0.333. Now, there are three different other descriptors. Our objects can have modes, types, and classes. In this case here, x is of mode numeric, of type double. It could also be integer, and of class numeric. Mode and type of is like numeric and double and integer. Class is very often used for special r objects. So for example, some packages that do very involved analysis will have result objects that have their own class. And at the same time, they will have an addition or an extension to the print function that will handle that class. And for example, print you, say, a cluster diagram of a cluster that you've just computed simply because the output object has been labeled with the appropriate class and an extended method for the print function. So defining the class of an object helps customizing the behavior of r for particular objects. And then there can be attributes. So objects can have attributes. In this case, x does not have attributes. OK, I'll get to that in a moment. So x has no attributes, so the result of attributes x is the value null, which is an empty vector. So the result of is null of attributes is true. And this conditional statement then says, if it's not null, then print the attributes. But in this case, it is null, so we don't print the attributes. So this n here, backslash n, is a new line character. So basically, it's a carriage return. The backslash in a string is an escaped character. So the backslash in a string means this is not the letter n, but in this case, it's the new line character. There's a number of different ways of specifying useful things in strings with escape characters. Backslash t for tabs, backslash s for white space, backslash n for new lines are the most important. OK, so this is what type info does. It basically has a number of ways to look at the properties of an r object. So that's what it does. It tells us the value, the structure, the mode, the type, and so on, and no attributes. That's a useful utility. We'll look at a few objects from time to time because we basically need to inspect them. So it would be nice if r would start up having a function like that, which doesn't yet exist. So can we customize r to load this function whenever we start it up? The answer, obviously, is yes. But how? Where do we save it so this becomes available whenever we start up r? How can we achieve that? What do you think? What does Google think? What happens when r starts up? Or what would be a good Google search for that? What would you type? How to load strips on r startup? Yep, that should work. Notice you're making an implicit assumption that your startup is done, that it actually has to do with loading a script. I think that's a safe assumption. But it's an assumption. I would perhaps just try to customize r startup. And probably the answer is you have to load a script. Everybody got that? So what do you find? What suggestions? Anything? Initiation file? Where? What? How? It is our profile site. Can you give everybody a little bit more context? Oh, sorry. So I see that when r starts, it sources from this r profile site. OK. Do we have an r profile? Apparently we do. So there are several ways and several levels of r profiles. There's a site-wide r profile, which I think is also called r profile, which is in your home directory, which r starts whenever it starts up. In addition, our studio runs a file called r profile in its project directories. So these are project-specific r profiles. So it doesn't automatically run r profile in any directory. But if you load a project in the project directory, our studio will run the r profile file. It finds there. So let's have a look inside. What does it do? So this is how it works. This file contains the definition of the init function, which we encountered. This is where it resides. And then there's a block here with a lot of cat commands. So if we execute that, we get this line and welcome and type init to begin and so on that you encountered on the startup. So apparently, this function is what ran when we first loaded our project. It defined the init function by sourcing the file init.r, which we've already had a small look at. And then it printed out the welcome message. Show the command lines like the cat, whatever. It just showed the output. Exactly. It shows the output. Because this is a script file, and it's sourced. So its content is invisible. But the result of what it does is being shown. OK. So this is where our type info function should go. Now there's a little scripture called function template. And that's a useful way to write functions. So you call your function by name. You describe essentially what it does. You add to do's. Usually you write it and then you come up with good ideas that how you might want to maintain this. Add notes for posterity. Give the function a name. Have a little header in your functions that describes the purpose and the version and the date and the author so that you can remind yourself of who did this, when, and why. In the context of a laboratory, things like that become more and more and more important. This is all part of reproducible research. And having an attitude towards excellence in these things will go a long way in making the work with the programming language enjoyable. Then a section that explains the parameters. So whatever parameters go in there, you explain what they are. And then a section that explains the value, i.e. the return value of the function. So now in order to make a little type info function that we source during startup, what I do here is I check this box of function template. And I tell our studio to copy this. And I call it type info.r. And I close my function template. Now, all of the files here are sorted alphabetically, except for files that you add during a session. Files that you add during a session will appear at the end. If you restart the session, type info.r will be found under T. But if you first create it, it will be at the end of the file list. So I click on type info.r, and I can now edit it. So my version is going to look like this. Type info.r purpose objects to do limit output objects. You can see that I know something that you don't know. The function, of course, is called info. The variable was called x. I can just take this here and paste it into the code purpose or objects version. If nothing else, 1.0, 0.6, that's me. Parameters x, no parameter b. Value, is there a return value? I don't think so. So I just say return value none invoked for its side effects of printing information. That's it. And then it has that. And this line. This is a line I always put on the end of my script files. Why? Because sometimes I copy and paste things and put them into emails and send them around. If my end line is not included, I can then suspect that I made an error copying and the thing is incomplete. Or whoever gets it knows that there might be something wrong. But obviously, that will only work if I religiously have that end always in my R scripts. So I always put that there. It's not part of the language. It's just a little bit of convention that makes it easier to catch some errors that are otherwise difficult to troubleshoot. OK, so when that's done, I save it. And the saved file has all this information. Is that enough? Will now type info be known the next time the R session starts up? You're right. So what else do we need to do? Exactly. We need to source it in R profile. So we go to R profile. Now here's one thing. R profile, if you notice, it's a file name with a dot at the beginning. Not all operating systems show files with dots at the beginning all of the time. Windows, Mac computers, and Linux computers hide them by default. In Linux computers, you can list them with a special directory command. In Mac computers, you can set a global option to always show files with a dot at the beginning. In Windows, I think it's rather involved. So it can be a hassle to actually find these so-called hidden files. One of the easiest way to find a hidden file is to simply use RStudio, because RStudio displays all the files, whether they're hidden or not, in its files tab. So you can easily see that there's a file called R profile, and you can easily open it, and you can easily edit it. And especially on Windows computers, this can be quite involved. I don't know, can I? I can't. Mine is as bright as possible. What can I do? OK, during coffee break, we can experiment with different color schemes. Maybe I can use a different editor background color scheme, zoom in even more. OK, there's a limit to that, because I get less screen real estate. But I think this is still doable. Let me see. Yeah, I think we can work with that. Yeah, that should be OK. OK, so type info, and save it. And I think that's all we should have to do. So let's see whether that worked. Any changes I made in here should not be saved, so I simply close this window, don't save, and I quit RStudio, and I restart RStudio. And I don't have to re-download everything from GitHub, even though I could. But if I want to re-enter a project, I go file recent projects RIntro. Not recent files, but recent projects, and not new project. That's what we did before. But we opened this as a recent project, and there we go. And it says, welcome, type in it to begin. But lo and behold, I have type info here with all of the things that I just edited. OK, right, that is only for that specific project. OK, so how do I do things that I always want to happen in all of my projects? First of all, it's a good question whether that's a good idea at all. The more customization you have on your computer that slightly varies from what other people are doing, the more likely you're going to run into inconsistencies when you discuss things or have questions or answer questions. Software engineering and writing software is all about making implicit knowledge explicit. In that vein, I think it's a good idea to keep everything that belongs to a project local in the project file. Otherwise, you would be sharing code with your collaborators, but you have an initializing function that they don't have, and then things immediately become difficult. So to keep things consistent, it's important to keep things locally. Now, that doesn't answer your question. It just warns you that it might not be a good idea. And there's two ways to do this. One is using the global R profile, which I believe is in the home directory of Mac computers, and it might be in the C colon backslash programs directory under Windows computers. But Google will tell you exactly where that is. The other option is something in it. No, oh, I took that out here. Something that if I use an init function like that, something that I put into that is often to source a file which I call utilities.r. And that script file utilities.r would contain general project specific context. So things like type info or loading a particular set of bioconductor packages or otherwise customizing startup. But in order to keep things local, it's a good idea to keep them locally in your R profile and source things there. It's the easiest then to see exactly what's been going on. OK. Now discussing data types, which we'll do more of after the coffee break. I would just like to mention a particular kind of syntax. So you've seen this. This is an assignment. x, I assigned the value 5 to x. But if I put the whole statement in parentheses, I get this. So putting this statement in parentheses executes it. And then since in parentheses it's an r expression, it also prints the value of that expression. The value of that expression is basically the value of x at that point, which is 5. So putting x assignment 5 into parentheses is exactly the same thing as printing x. Only I put it all on one line. So sometimes I use that syntax. I think I should more consistently use it. It's more explicit to then type x. But if you find an expression which is on its own in parentheses, this is what it means. Take the contents and show me what the result of that contents is. Now we can have a look at what that object is by looking at its length. This is an r function, which gives you the number of elements in an object that contains more than one or more elements. In this case, the length of x is 1 because it's just one number, and we can use type info on x, which says the structure of it, the value is 5. The structure is it's one numeric of length 5, mode numeric type of double and class numeric. OK, now let's do the same thing for this construct here, which you find frequently in r scripts, 5L. Now, why it's L has historic reasons that I'm not even going to go into it. It's a long integer, but this is also 5. It also has length 1, but it's slightly different. So the first 5 was actually a double precision floating point number. So it's 5.000000. This one is a long integer. So it's 5, exactly that, only 5. And integers and floating point numbers are encoded somewhat differently as digital numbers. So we can distinguish 5 and 5 by using the type info function to look into the internals of our object. So this is an integer, and this is a floating point number. Similarly, if we assign true to x, it's a logical value. Mode type of and class are the same of logical, and the value itself is true. Now, here are a number of special strings in r that include the following. So these special strings are language keywords. They are recognized as naked strings as they are. Don't put them into quotation marks. You can put them into quotation marks. Of course, you can put everything into quotation marks, but then it gets a different meaning. The string true is not the same as the logical value true. So type info true is a logical, but type info true is a character. It has no particular properties relative to logical. So these keywords do not go into quotation marks. True and false, na for not available, null, which is simply an empty vector, inf for infinite, minus inf, and nan, which means not a number. So for example, log minus 1 is nan. And it also throws a warning message because it assumes correctly that under normal circumstances, we would not intentionally produce NaNs. There was a question. Somebody had a question. She's having an error of the result bit of your function. Sorry, I've got an error that the result function is not found. So could you show where your type info function is the result bit? The type info function? Show results. There was a return statement in the function template, but that needs to be removed because there is nothing to return in this case. Sorry, that was slate of hand editing. OK, so we'll tackle these tasks. I'd like to know what are the types of these special keywords, logical or numeric, or what are they? And in particular regarding Na, what happens if we cast them as numeric, as integer, and as character, i.e., take whatever the contents is at some point and then convert it into a numeric, or logical, or character, or integer. We'll do that after the coffee break. And I think it's a good time, 10.30. And my little brain, when do we start again? At 11? Half an hour of coffee break. Yay. Oh, and we'll take the picture now. For our picture now, so again, remember, if you checked off, do not share my picture with anyone. Don't take part in the picture. Otherwise, we're meeting at the take our picture, and then you guys can have your break. OK, so heading on over to the stairs. And I should maybe get a new version. But does your version have pause recording? Or just stop recording? Red film reel. The pauses? The red film reel. I should have a pause. Sometimes we have to go through a bit of dry stuff. And I think that the driest stuff is kind of almost over. I'm not going to go through all of these in very much detail, except for one thing, NA, not available. You have to deal with NA very, very often if you import and use and work with actual data. So not available is how values are labeled for which during the import of a data set, no value was given. Now, if I simply do type info NA, it assumes NA is illogical. So this is either true or false, not given. But there are several functions in R through which I can cast the type of an object to a different type. And there's a hierarchy of things that can be cast into each other. An integer can be cast into a floating point number. And a floating point number can be cast into a string. That's obvious how that is done. But it's not generally possible to cast a string into a number, not in general. Sometimes it is possible. So these are casting functions, as numeric, as integer, as character. I think the name tells exactly what it does. Note that in this case, it has the dot variable names. So type info as numeric NA tells us that it's no longer logical. It's now numeric. It's still NA, but now numeric. Mode, type of double and class numeric. Or type of integer as integer. Mode is still numeric, but now type and class as integer. Or as character, everything is character. And note that NA as character is not the same thing as this here. This is actually the string NA. And it's not the same thing as NA in R. Nor is this here. Nor is this here. None of these are NA. Only this is. OK. Now, why would it be useful to have different kinds of NA? Why can't it all be logical? Any guess? Of a certain type, and you're going to perform functions on that data that they expect to be the same type? Exactly. Exactly right. So whenever we have collections of data, for example, that we put into vectors, the key rule about vectors and incidentally about matrices is all elements must have the same type. If there's an element of a different type that is added to a vector, then the type is being cast to the most general type that will accommodate all of them. And that happens silently. So if we have a vector like 1, 1, 2, 3, 5, 8, and then add an NA to it, and NA is logical, that NA would automatically be coerced to a numeric value. Or if that NA would happen to be a character, then everything would be converted into character. But in this case, since NA can have different instances with different types, NA can be incorporated into all of these vectors. So just for an example, let's look at this. This, oops. No, come on. So here I have a string which happens to be the string NA, not the NA value. And I put it into a vector. The result of that is that my number 1, 2, and 3 are automatically converted into the numeral 1, 2, and 3. And then the string NA. However, if I do this, they nicely remain to be numbers. So here I have a vector of three logicals and one unavailable value of NA. What happens if I convert that NA into a string? Will it? Try. So the easiest way to try is use the up arrow key on the console if you've typed this before. Then double click NA to select it. And then simply type a quotation mark. So in the RStudio edit, there are a couple of enclosing characters. And these include square brackets, round brackets, double quotation marks, single quotation marks. And if you select something and then type, say, a double quotation mark, the quotation mark doesn't replace the selection. But it encloses the selection, which is very convenient. When you did the one that was just C, 1, 2, comma, 3, comma, NA with no quotes, you said that it assigns the NA to whatever type makes everyone fit. So would it not assign NA to an American? Because there are other ones here? It does. But I did that, and then I typed in for NA, and it said NA was still logic. OK, let's look at this first. So what happens now is that true, false, and true are now converted into strings. And which is unfortunate, because false used to be false. But a string that is anything but the empty string can be interpreted as a logical true. So we're getting into trouble here. OK, now Brandon, right? This here. You said NA is still logical. Well, yes, in general, but not in this case. How do we know? Well, let's have a look. Let's assign this thing here. What's the difference? So when I did it, I just did the C, 1, 2, 3, 1, comma, 2, comma, 3, comma, NA, and then I just did it right after that type info NA. Right, because you're not changing the global value of NA. You're just changing the characteristics of that particular NA. Don't think of NA as a variable name. NA is not a variable name. It's a placeholder. So the element here does not have the value 4, but NA. And in this case, because it's in the vector, and in the vector all of the elements have to have the same type, NA is a numeric. Generically, if you look at only NA by itself and it's not associated with anything, its value is logical. But you can see that in this case here, as part of the vector, we have a vector of four numeric values, and the whole vector has mode numeric type of W. And we could do the same thing with integers. And then the whole thing becomes an integer. So I think, again, the key is NA is not a variable name. It's a placeholder for something that's labeled as missing data, a kind of number. But it could have any value. OK. Recapitulate about vectors. We produce vectors, for example, by explicitly specifying the elements in that vector. You've seen a few examples in my typing. That could be numbers. That could be logicals. That could be characters. And we produce the vector with the C operator. C is also for concatenate or think about it for combine. So it combines these values within the parenthesis to a vector. And we can assign it. And now, our variable V is a vector that has length six. The variables we used before all had length one. Variables of length one, we also refer to them as scalars. But this has length six, so it's a true vector. And if I want to add something to that vector, I can just dynamically extend the length of the vector by using the C operator again. And in this time, I'm combining the vector V, which I have defined before, with two additional numbers. So vectors can grow and shrink dynamically in R as needed. We'll probably meet a few examples. So there's a number of ways we can produce vectors. The simplest and most obvious one is the range operator, so the colon. One colon three gives me a vector of three elements. OK? Now, can you type a range of numbers that go from 7 to 15? What about from 15 to 7? What about from minus 3 to 3? So the range operator is a convenient and quick notation to create integers. We often need such ranges of integers when we iterate through a loop. So when we write program code, that repeats a particular expression many, many times over. This is where we have the range operator most frequently. Just to mention this function, there's also a function called seek along, which builds a sequence of numbers along the elements of a particular variable. So remember, we had to find the vector V to contain eight numbers. So what does seek along do? Seek along V. Right. It gives numbers from 1 to 8. And these numbers from 1 to 8 are the indices of the elements in that vector. If you've, and we'll get more of that, if you've gone through the introductory tutorial, you might remember that we can say something like V3 to get the number 2, or V4 to get the number 3, and so on. So return the third element of the vector, return the fourth element of the vector, and so on. So seek along gives me the indices of all the elements in the vector. And yeah, we'll discuss later why that's important. Just keep it in mind, all the elements in the vector. How do we get all the elements themselves? Just write V. But obviously, from what I've just said, that gives me the same result, too. And I can get the numbers in different orders. So for example, I can extract them forward and backward and select them more than once, and so on. OK. There's another very useful function. That's one that we use a lot when we create sequences of numbers that are not integers. But we might need them to create a variable that we plot against, for example. And you can, for example, do something of seek minus 5, begin at minus 0.5, end at 0.5, and use an interval of 0.1. So it counts up with that distance, creates the sequence of numbers, as a vector. I think you can immediately notice how that might be useful for the plot. Yeah? Why was there a need to write scene, to get rid of it? So if I wouldn't do that, I would interpret this as a four-dimensional error. So if I have a two-dimensional error, and I see vector matrix 1, 1, I'm addressing the top left one. So this would be a four-dimensional error. This, since it says third element of that vector. OK, there's a variant of seek that allows you to define, through a different option, an output vector of a predetermined length. So let's have a task. Create a vector from 9 to 12.7 with exactly 21 elements. Using, how do you start working on a problem like that? A question like, create a vector from 9 to 12.7 with exactly 21 elements using seek. I'm not sure what Google would say to that. It might not be that useful. Exactly. So the first thing we can do is check the options of seek function. So I hover over seek. OK, it tells me, oh, the options are too many to mention. OK, let's have a look here. Sequence generation. So in the most common instance here, there seems to be a version that has an option length dot out. Maybe that's the one we need. And if it is, what should we write? Seek parenthesis. That looks good to me. There are big differences between length out and just length. Why? That's because R supports argument abbreviation. So you are allowed to abbreviate arguments to the shortest, unique combination of characters. Now I've said that. Forget it again. Don't do it. So that's why that works. But I think it's better to explicitly write what the actual argument name is. It can be very confusing, otherwise. Why did it? I think it makes a compromise between precision and the amount of ink it puts on the screen as a result of a single command. So it tries to automatically guess what the best representation is. So for example, I could assign this to some value and then say print is 10. That's odd. Well, maybe that's an exact solution here. OK. So I should have used even more odd numbers. So in that case, the answer would be, it does it because everything else would be trading zeros. I'd find a variable x, which I find as 3.7 divided by 21. That's 29 to 21 by x. It's relatively different from the output that's taken as a quantity. Yeah. But I'm sure that this is correct. It has 21 elements. As far as I can see, it starts at 9 and ends at 12.7. And if we plot it, we'll see that the intervals are equal. OK. So these are some of the ways to produce vectors explicitly by using the c operator or as the result of the operation of the colon operator or as the result of seek along or as the result of the sequence generation method. And one more thing is rep for repeat. So rep creates a vector where it repeats its first argument n times. So this just says, ha, ha, ha. Now, with all that conceptual introduction, let's look at some biology. So this is a vector which now contains genames. Spick, sed, BP2, lyse 2, SFPI1, NFKappa B, inhibitor Z. And they happen to be markers for monocytes. Why is that interesting? When I was looking for a paper, some data that we could be working with for this kind of workshop, I came across a paper by Jitian et al. I think 2012 or 2014 on single cell RNA-seq analysis. And I've put the paper into a zip file. And the zip file is password protected. So I'm technically not distributing copyrighted material outside this course. But the password simply is lowercase cvw. So I'd like you to unpack the paper and unpack the supplementary material. So let me say a few things about that paper. It's relatively recent. I've randomly chosen it for the type of data it uses and the type of questions it produces. So Jitian from Ido Amitz Laboratory, Science 2014, massively parallel single cell RNA-seq for marker-free decomposition of tissues into cell types. Just looking at that slide, what is this about? What do they do? Marker-free decomposition of tissues into cell types. It's separating the health signatures. Exactly. So by first principle, it's just looking at the data set. So they are taking tissues and they're taking cells from the tissues. And then they're measuring something about the cells. And according to the results of the measurements, they hope that they are able to find cell types. So taking the tissues and then determining what cells actually are members of that tissue. And they do that only from looking at the features of the cells and not from any of the well-known and well-described molecular markers like cell surface antigens. And the way they're doing this is using single cell RNA-seq, i.e. they're basically decomposing their tissue into single cells and then using RNA-seq on every single cell. And doing that on, well, massively parallel, this approach can be scaled up, but it's a pretty large number of single cell RNA-seq experiments. Now, this is not trivial, because what makes a cell gain its identity? That's actually a pretty deep question. How does a cell know it's a fibroblast? And how does another cell know it's a monocyte? And that's a combination. Well, we used to say that's all in the gene expression profiles. Now, nowadays we know, yes, that is true, but it's also true that the nature of the gene expression profiles is very much determined by epigenetic modifications. So it's kind of a mix of both. The current state of the gene expression giving stable expression feedback loops and also being imprinted in terms of epigenetic modifications on the genome determining which genes will be expressed at what level. But it's basically a hypothesis that simply based on the expression profiles, we should be able to distinguish tissues, i.e. that the way that a cell expresses its RNA is characteristic for what type of a cell it is. Or conversely, there's actually no difference between the cell type and the expression profile. It's the same thing. And that actually was an assumption. I think it was a fair assumption to make, but it hadn't really been shown. Because in order to show it, you need to do single cell RNA and seek experiments and compare cells. So what they did is actually put this to the test. They looked at spleen tissues, decomposed spleen tissues into individual cells, and then looked at RNA-seq expression patterns. So the way this works is you take the spleen, you harvest individual cells, and you put them into 384-well microtider plates, and you lice them, and you barcode them. i.e. every single well gets a little DNA sequence that is characteristic to that particular well. And after that barcoding is done, all the mRNA from one well will have the same barcode. So then you can pull it and throw it all together, and you then pull it together, and then you throw it into the sequencer. And the sequencer will find sequences that have barcodes from this well and barcodes from that well and so on. And then you can pull it apart by well or by implication by cells. Now if you've diluted them sufficiently, that statistically, it's a rare occurrence that ever two cells make it into the same well. At that point, all of the RNA-seq from one well originates from exactly one cell, so you have a single cell experiment. And then you run it through the analytical pipeline. And as a result, you can find that the expression profiles that you compare in the cell's cluster. So if you compare profiles against profiles, some of them will be similar to others. So you will have clusters of cells that kind of look the same. And that's not trivial. Basically, you could also have a situation where there's a range of similarity that goes from one extreme to another extreme in many, many dimensions of differently expressed genes. But that's not the case. There's not a continuum of expression profiles. Cells actually adopt these metastable states where all cells in that cohort or in that cell type will have similar expression profiles. And that's demonstrated by finding such clusters when you compare expression profiles against expression profiles. And they then went on. And lipopolysaccharide is a potent stimulus of the innate immune response. So if you threw that onto hematogenic cell lines, they have characteristic responses here. So you can then look at the clusters and see how they respond to the cell lines and then make inference about what cell lines they could be from. So there's a cell line that's characteristic of B cells, for example, that has high expression of CD19, CD22, CD37, and so on under all conditions before and after LPS stimulation. Or there's a cluster which is consistent with macrophages and a group of genes there that are not significantly expressed under resting conditions but are induced to expression under LPS stimulation and so on. So this is a result of data analysis of clustering. And you can then throw all of these multi-dimensional data together and you can find, in a particular projection, you can find clusters and groupings of cells. So for example, if you do flow cytometry and identify B cells by cell surface markers like CD19, B220, you will find that the B cells that are identified with flow cytometry in their expression profiles cluster into this part of this multi-dimensional separation map. Natural killer cells cluster down here. Monocytes cluster all over here. Polymorphic dendritic cells cluster up here. So indeed, our known markers correlate with this multi-dimensional analysis of different expression profiles. So to play around with that data and to understand that, what we need to do is to download the supplementary data that the authors have posted and then get that into R and then play around with the numbers. And that's basically how to work with R101. This is going to be your daily bread for data analysis. Take a data set from somewhere and then load it into R and then start exploring it. And that's what we'll do right now. So what does this look like? As one example, we are going to look at characteristic genes from Figure 3. So this is Figure 3. Or sorry, Figure 4B. Maybe I mistyped it. Let me open this first. So this is the paper. Right. So there's a list of genes here. How do I get names of these genes into R? Unfortunately, this is an image, so we can't copy and paste it. Images are dead ends. So if you ever publish your work as an image, you have to be aware that nobody can then really, reasonably work with it. Data that is published as textual material is easy to work with and easy to import. But images are difficult. So in this case, I ended up cutting out that image here and then uploading it to an optical character recognition tool on the web. And that optical character recognition tool got some of these words right. And then I ended up editing it by hand until I finally came up with this text file. So this is a text file of these gene names. So you can imagine, after downloading the supplementary material and massaging it in some way, you will end up with a text file of gene names of interest. Now the challenge is, how do we get something like that into R? How do we take this text file and assign it to an R vector where each element is one of these elements as a string? What would you do? There are several different options. But let's kind of talk about how would you approach this problem. OK, so I copy and paste this. What do I need to do next? So I put it into parentheses and add a C. OK. Now my editor tells me there's problems here. All these red Xs mean there's something going on. Expect that comma after expression, no symbol named CD19 in scope. So that's two errors. And it's actually important. Read error messages. People often just say, ah, it didn't work. And then freak out. But no, this is important. So the first line says, expected comma after expression. Well, we know. We didn't actually put a comma there. Now here's a magic trick. If I hold the Alt key, see how the cursor changes when I hold the Alt key? And I can select more than one line. And now anything that I type will go on all of these lines at the same time. I also might want to put two spaces here to line it up more nicely. So the Alt key in the RStudio editor makes for interesting text editing experiences. OK, but now the Xs have gone away. I no longer have a syntax error, but I have a warning sign. And the warning sign says, no symbol named CD19 in scope. What does that mean? It's kind of a techie way of expressing something that's relatively obvious. So what is this here? Is it a number? Is it a string? Is it what is it? If it were a number, it shouldn't have letters. If it were a logical, it should only say true or false. If it were a string, it would have quotation marks. So it has none of these. So what is it? It looks like a variable name because it doesn't have quotation marks. So in order not to make it a variable name but a quoted piece of string, I need to add quotation marks. So again, either double clicking and then just typing a single quote or using the nifty alt key. In this case, it doesn't work as well. There we go. So now we have a vector that contains these elements. That is the default pedestrian, slow, not infinitely scalable solution that, nevertheless, we'll end up doing in a large number of cases. There's much more automatable ways, which we'll talk about in a moment. But that also works. So the first possible solution of getting these gene names assigned into a vector would be to do something similar. So for example, I could use the text editor to replace every new line here with a comma and a space and then add quotation marks to it or whatever or just add them by hand if it's not too much. And now that would be one possible solution. OK. One of you mentioned Excel. You did? So what would you do? You could transpose it in Excel and then paste it all into one line. OK, Excel. Hang on. I'm not so fast. Copy, paste. Yes. OK. So don't transpose it. Maybe I would, in B column, I would add the first quotation mark to B1 to fill series on the way down. So in column C, I would open bracket. OK, it gets creative. I'm not sure where this is going. You know what I would do? OK. Now if we get, we'll work with Excel later in the moment. You would need to somehow get these fields together. And that is not a trivial task. But what we can do is we can use Excel to save these as, no, strike that. If I save them as comma separated values or tab separated values, the result looks exactly like my beginning text files. So never mind. I think Excel, at this point, doesn't actually help us. Because the best result we can hope for, after copying and pasting, is a file that looks exactly like this. And it doesn't have the quotation marks yet. So let's try something else. Exactly. So that's what I also often do, simply find and replace in the text file. And it's a bit tricky because I would need regular expressions and we're not ready for regular expressions yet. But in principle, like, for example, in text edit on the Mac, I can use, and probably in Notepad too, I can use find and replace to replace new line characters with quotation mark comma new line quotation mark. And that basically then changes the text. Here's another way. So one thing we can do is we can simply take this and define the whole thing as a string constant. There's some weirdness going on here. I should separate the numbers. There is something wrong with the... So I probably ended up with the wrong kind of line of separation characters here. So I don't know why that's the case. If I look into my script for sample solution read text, there's a version where we do that all at once. So basically this way. And that doesn't work either. Something is not as I'm expecting it to be. OK, I need to figure out what's going on there. Anyway, in principle, if you have a long text, you can simply enclose the text into quotation marks and assign it to a value. And then break it apart at every white space or at every comma or at every new line and then work with the elements in this way. I need to figure out why this is not working. It must have something to do with copy and pasting the wrong kind of quotation marks. But again, this is a very manual version and we'll have a lot of cleanup to do. R has a large number of functions that are designed to read data files. And the most basic function is the read lines function. So the read lines function takes as an argument a file name and you can give it more options. But as the result, it will generate a vector. Oh, I might have maybe I just had a stray quotation mark. I had a stray quotation mark somewhere in there and this was messing it up. OK, so the other version would work too. Anyway, so the first two functions work, but they would be very pedestrian. So a more automated way is this read lines function. Read lines takes as an argument a file name and then assigns the result to some variable. And now after doing this, every single gene is on its own line because we have one gene name per line. So the read lines automatically treats all of this as strings and automatically separates it line by line. So for simple input like this one here, where we have one element per line, that's the simplest thing we can do. This produces a vector of strings. So 46 elements, all of mode character. Now very often our spreadsheets don't look like that. Typically spreadsheet data is two-dimensional data, so there would be more than one column. And then read lines would take entire lines of comma separated or tab separated values and that's not what we would need. So in that case, there's a function read CSV or read TSV. And that's what we most often use to read files, read data, say from Excel spreadsheets. We'll work more with that a little later on. But read CSV has a few more options. The first argument is the file name. Then we specify whether our file has a header or not. In this case, we just have blank gene name, so the header is false. And we'll discuss strings as factors as false a little later on. That's a bit of a quirk in the language. But if we have data in an Excel spreadsheet, then we can save it as comma separated values, CSV. And that output of an Excel spreadsheet is exactly the input that we can use in the read.csv function. So if you have large and complicated spreadsheets, you can save them as either CSV or TSV files and then read them in this way. Excel, XLS or Excel, SX files would not be read. There's a package that reads them. The package that reads them is pretty good. But not 100% perfect. Do you recommend CSV? Right. So what I usually do is I export things as CSV. With one added caveat, if there are numbers in your Excel spreadsheet and you save it as a text file, i.e. CSV, Excel will truncate the numeric precision of your numbers. So what you need to do before that is to convert your numbers into text. Then you will get the full precision. OK. Now the read.csv command for our little file of gene names looks like this. Here is false strings. This factor is this false. But now our characteristic genes are not a vector, but a data frame. So the mode is list type of is list class is a data frame. We'll discuss that in a little more detail in a moment. And this data frame has attributes, i.e. for example, row names, which are these numbers here. And it has one column, which it called v1. And that column contains these gene names. So a data frame is a bit of a more complex object. But it also allows us to store data in a more versatile way. As opposed to read.csv. Yeah, that works too. So the different sample solutions of four different ways of taking such string data and reading that into R is in here. Yeah, let me briefly mention this one. It uses unlist. So once we've assigned our entire text string to a variable s in this way, this is what that looks like. So cd19 backslash n is new line, cd79b, backslash n is a new line, cd22, and so on. So this is just one string. These strings can be very long. There's no problem with having a string with five million elements. So R is very efficient in managing very large data sets. Now we can split the string apart with the new line characters using a function that is called strsplit, string split. So string split will split a string according to a pattern. So in this case, string split splits my string s according to the pattern new line character. And this means it takes this entire string. It breaks it into elements by removing each of the new line characters and assigning what comes before and what comes after into separate elements. So the elements of the resulting string are not themselves included. So I can use this to split apart comma separated lists or tab separated lists or any kind of list that I want to work with and define. Now the result of the string split is itself a list. And this is because that argument s actually only contains a single element, i.e. one string. But it could contain more. It could contain a string with 15 names and then two names and then 27 names. And each of these would be separately split and then each of these would end up in a separate list element. But for our purposes, we only have one column. So we want not to have that end up in a list. So we use the unlist function on that. Usually if you know that you get a single string into string split, you always wrap the string split expression in unlist. And that gives us a vector with 46 elements in the individual names. So from a typing point of view, this is a shorter way. We only need to take the entire list, write the assignment vector at the beginning, close it off with a closing quotation mark at the end, and then run unlist string split string s split by new line characters and assign to characteristic genes. And then we get this vector, 46 elements of characters. That's what we wanted. Or scroll where? The sample solution? OK. OK. Now let's work a little more with strings while we're thinking about strings. When we do things like multiple sequence alignments of building phylogenetic trees, we often need labels for our sequence names. And we don't really know which of these sequences are homologues or orthologues or paralogues and how they're related. But something we usually do know is what species they come from. So it's often very useful to take the species name and derive a short form from that, a little mnemonic code for the species name, and label the sequence position in the front of the multiple sequence alignment or at a leaf of the phylogenetic tree. I call these short things bicodes. So these bicodes take a binomial scientific name and produce a five letter label. If you've ever come across old style Swiss prod sequence names, they usually have these bicodes like that. So for example, if you have arbidopsis taliana, the bicode should be arrat. The first three letters here and the first two letters here. So your challenge is write a function that takes a string and then takes the first two words from that string and produces such a bicode as a small utility function. So how do you go about doing that? Always the same question. You have a small goal, you know that it's possible, not just because I said so because it sounds like something that the computer should be able to do. But how? Now in order to write a function or to write program code that does this, you'll need to break up that task into individual steps. So what's the first thing that we need to do? Very first step. And then what? The first step is to find these scientific names and define them as a thing. So assign input. So then you have one string in a variable. And then what? Remember you're trying to get letters from the first word in that string and letters from the second word. Separate the words. Did we come across something that separates words? What do we use as a separator? Space. Space, so length, space. And then you could do spring split again and separate it into individual characters. Or change the type. You can't change it. I can get the individual character. So I can split the string into individual characters. And we can try that. It's not the canonical solution, but it's perfectly working. Spring split, quotation mark, quotation mark. The empty string. So if I string split on an empty string, it breaks the string apart. Find like a second variable by code that takes the first three letters of one part of the vector and the second, or the first two letters of the second part of the vector. Right. So this is what we would do with this second spring split. We would then get a vector which has the characters of the first word. And then we can take the first three elements. You can't just pull it out if you haven't split them already into separate characters. But you can also pull it out directly with a different function that is called subscript. Both work. There are different ways to achieve the same goal. Well, maybe we should try a few options. Maybe we'll just type it out. What's our favorite organism for the day? Canada's 150 that's coming up. Huh? Custor, whatever it is? Custor? You know what? I thought it was Custor fever, but you may be right. Is it not Custor fever? Can somebody fact check me here? We're not going with all facts. It's the Eurasian one. See, this is where I come from. So it's actually Custor canadiensis. This is Custor canadiensis. Not I in canadensis. They're awesome. OK. So that's our first step here. Somehow put this into a string variable. And so the second step was string splitted. What should I type? Not yet. Let's just work it line by line and then take everything together and define it as a function. So in order to use the string split function, I just need to type str split. And that tells me it expects arguments of x, which is the input. And then it expects an argument of split, which is the pattern that we want to split on. So in our case, the x is actually our variable s. And the split would be, does that do something useful? Well, in my case, it produces a list with one element. And that one element is a vector of length 2 with one word and another word. So that's getting there. So let's assign it. Now, to do something with the first word, what would you type? Axis only the first element. And that would be v1 in square brackets. Is it? What's going wrong? I assigned the result of the string split to v. And then I took the first element, v1. And what it gave me was this here. How is that v1? Your value is paid. It says that v is just a list of 1. Yeah, it's just a list of 1. Right, so it's still a list. We didn't do this unlist thingy. And the list has one element, which is one vector of length 2. So that's what we got instead. OK, we can fix that. Now, v is a vector of type character of length 2. Yay. Now, what do we have to do? Right, we can use substring. So let's have a look at that other option first, using string split again. What would that look like? So I do the same thing, unlist string split. Now I use only v1. And I split on the empty string. So I get this. Very nice. And for example, I could assign that, and then get the first characters in this way. We still have to piece them together, but that's yet another step. I don't know if I'll get it, if are we discussing this later on. But I'd just like to show you that here we are assigning the result of unlist string split v1 to the variable w. And then we are sub-setting the first three elements of that vector. But we don't actually need to assign this. We can take this entire thing, which is a vector, even before we assign it, and then simply attach this selector statement to here. So this is also legal syntax. It's the same thing as assigning it to an intermediate variable. But it does the whole thing at once. It's a more compact way of writing things. But since now everything is in one line and executed at once, it becomes harder to read and harder to debug. It's just if you come across something that looks very odd, like a construction like this, remember, if something is a vector, we can attach the selection operators, the bracket operators, directly to the vector. We don't need an intermediate variable. OK. Now, these are individual characters, and we'll see in a moment how we can piece them together. But there was this substring idea floating around. How could that look like? Substre, x, start, stop. OK. That looks reasonable. So what's the substring expression for that we should use here on v1? What's our x in this case? v1, start, 1, stop, 3. Good. What's the difference between the one we had before and this one? It hasn't been split. It hasn't been split, right? So this substring gets us one character, one string, whereas this split version gives us three individual letters. Now, how do we get the first two characters of the second word? You can do that. The first two characters of the second word give me a guess. v2 is the second word. Start at 1, 2. Good. Getting there, little steps. Now we have to piece them together. Any ideas? Well, we can assign them. Let's follow your suggestions. Call them A and B. And now we need to put them together. The first thing I would think of would be something like this. But obviously, expectably, this gives me only the two elements. They're not actually put together. Is there something that puts them together? To upper? That's something for later. Are you looking into the solution code? Because I actually did preparation for R. Oh, was that in the tutorial? Oh, that was in the tutorial. Oh, so you all know it already. I'm sorry if I'm doing repetitive stuff here. OK, here's a hint. There's a function we've already used. And that's sprintf. Can you use sprintf? So this is one possible solution, sprintf. My format string is %s, %s, which is simply put a string here and put another string here. With nothing, no space or anything in between. And my first string is A. And my second string is B. So that's what it looks like. But more generally, we would use a different function. And that function is paste. Paste is used very frequently. So like sprintf, paste, so they're kind of similar. We can sometimes use them for similar things. Paste joins these two strings together by default with a blank space in the middle. But that blank is something we can turn off. How? With an option. And the option is separator. So for example, if we have a comma as a separator, we get this. If we have a colon as a separator, we get this. We can use many different types of things. Or we can just use the empty string. I don't think cat has a return value. So it would just print this, but not return it from the function. I might be wrong, but I don't think it does. Does cat have a return value? Well, we can see. So when I assign x cat a, it dutifully prints cat, but x now is null. So cat doesn't have a return value. We could basically get the effect, but we couldn't do anything useful with it. So paste a, b, separator of blank, I think that would be pretty much the canonical solution here. Sprintf works as well. Are we done? Anything else to do? Anastasia. Now comes the function that you've mentioned. Which one was it? Two upper. So the specification called for uppercase. And there's two functions here. One is two upper. One is two lower. That converts strings into all uppercase or all lowercase. So that has the functionality that we wanted. Now we need to clean this up, what we just wrote, and then make a function of it. So we don't need this here. We don't need this here. I think this is all we need. Yeah? I'm sorry, the Sprintf function does that also not have a return value? It has. Yeah. So it assigns. OK, so I think these are all the basic commands we need. Now we could combine things. So for example, since we're not actually reusing these variables, I would probably tend to replace this A here and replace this B here, which makes it more compact, but somewhat less readable. So I think it's negotiable whether that even is good or not. Maybe for now we'll just actually leave it like this. More explicit, easier to read. OK, now to make a function from that, we wrap the whole thing into a function. Function name would be probably iCode. And we assign function. And our input variable is s, right? Because this is the name of the variable that appears here. iCode is function s. And after that definition, we have to wrap all of this into curly braces. Many people will write their functions in principle like this because by convention, r returns the last expression that's executed inside a function block. I prefer to actually write explicit return statements. And another thing I don't like is indentation. So there's a function here, re-indent lines, which makes sure that the bracing style is maintained here. So that's our function, by code, function of s. Unlist string split s and split on a blank. We pull out the first three characters from the first element as a substring. The second three characters as a substring, we paste them together again. We convert them to uppercase, and we return the result. OK, what do I need to type to test this? iCode s. iCode s sounds good to me. Nice. Now, you could think of ways over lunch how this can go wrong. Because we're making a number of assumptions about how our input is structured. And these assumptions may or may not be correct. Can you name some of the assumptions in this code? There's no space. I think that there's a good assumption there that there is space. But there might only be one word, which is no space. So you're right. OK, so there might be less than two words. So the assumption is there are at least two words. What's another assumption? Right. The first word has at least three letters. The second word has at least two letters. What's another assumption? I don't know, and what I'd actually curious whether substring by default handles unicode correctly. Normal characters are only one byte wide. Unicode characters are two or three bytes wide. So whether substring handles that by default correctly, I don't know. But in principle, you could write your byte code, your species names with emoji. But otherwise, other than that, if they are representable characters on the keyboard, they can all be used in this way there. So there's no need to compensate for special characters. One important assumption is still in there that we need to be aware of here. We're splitting on a single blank. It might be more than one, or it might be a tab. So there ought to be ways of how we can change this to make it more flexible in terms of how the words are being separated. We'll do a very gentle mention of regular expressions to perhaps handle that. So even in the simple code, if you know what you are doing at home and you're responsible for the input, you can get away with that. That's fine. If your colleagues are supposed to work with that function, you have to give a little bit of thought of how they can put nonsense into the function and how that will break your analysis pipeline and how you can safeguard against that. So it's good if functions like that have safeguarding behavior and basically make their assumptions explicit. At the very least, in the comment of your header, you should write a little bit about what these assumptions are.