 We're at the second project and I presume you've downloaded it. We go through the same steps of downloading that project and installing it and loading it and typing in it and then We load the file our interlock R now This is a rather long script and this will take us through the remainder of the day where we're going to Start working more with functions and and especially Discussing how to work with data how to how to create data objects and so on how to start defining our own functions With a particular application in mind and that particular application is a paper Which is ancient now two years old You The worst thing I ever did was to decide to teach bioinformatics It used to be better about five years ago when things sort of settled out in the last five years Once again our information half-life is about two years So every two years I throw out at least half of all my slides of all the bioinformatics teaching it It's amazing. What's going on these days? So this is a one of the modern things that's going on this paper was published in 2014 and science massively parallel single cell RNA seek for marker-free decomposition of tissues into cell types What does that even mean? Anybody care to to interpret this title? So what's RNA seek? Expression Exactly, so we we we sequence messenger RNA and from the counts of our individual reads our individual sequences We infer the concentration of mRNA molecules When when we started teaching this workshop micro array technology was really new and Really amazing and that's not so long ago Nowadays almost nobody ever still uses micro arrays even though we'll be working with micro array data Tomorrow and the day after Nowadays almost everything is RNA seek because it's much easier much more convenient. You you have a higher dynamic range and so on Now one of the amazing things that people are increasingly doing with RNA seek is Single cell RNA seek and that's actually really important working with single cells avoids the problem that our expression profiles otherwise our composites of of an average of many cells well sometimes that may be what we're interested in but it can be very problematic if Our sample is a tissue, which is actually comprised of many different cell types Then we actually don't want all of these expression profiles to be Average together, but we would like to decompose them We would like to decompose them perhaps to identify which cell types or distinguishable cell types We have in our tissue in the first place Now in order to do that Given that we can do single cell Analysis we also need to do that on a large scale. It's not enough to look at one single cell That wouldn't help us either well in some circumstances, but we want a distribution of cells We want a distribution of individual expression profiles to learn more about what actually makes up that tissue and ideally We can then associate individual expression profiles with different cell types And that's what this This paper attempts It takes tissue splinic tissue It applies RNA seek to single cells in a massively parallel fashion IE in an experimental setup where you analyze many hundreds of different individual cells and then The paper attempts to cluster these cells into different cell types IE from first principles without any markers that we would use for example with flow cytometry or other knowledge Try to distinguish are there? prototypes of Cells are there similar groups of cells similar categories of cells that we can associate with the notion of a particular cell type So this is really amazing of course We we go through our knowledge of biology by saying well We have different cell types in the body and they have different behavior and ultimately the different behavior depends on different expression patterns But that's something we believe. I think this is an early experiment for actually showing that actually showing that the different expression patterns are sufficient to Divide the cells apart and identify the different categories so the authors and I refer to this as the Probably the Diego Atenmar Highteen or Highteen not J-Ten. So I tried to pronounce it Highteen. I'm sure this is Spanish Cellar diversity is thereby approached through inference of variable and dynamic pathway activity Rather than a fixed pre-programmed cell type hierarchy. So this is quite exciting. So Approximately, this is how it works you take spleen and you mush up the tissue and and Isolate individual cells and you wash these individual cells into a microtier plate such that On average, you only have one single cell in each well Maybe none, but hopefully never more than one Then you lice them and you barcode them. So barcoding is a is a technology where you Take messenger RNA and In this case you take the poly a tail of messenger RNA and you complement that with some With a complementary sequence With a particular barcode so every reagent every single well gets a different label to it And that different label then allows to identify once you have an RNA seek read Which well that read came from and in this way once you have the millions of reads that the typical RNA seek experiment contributes you are able to Follow the track of that particular read back to the well or to the individual cell where it came from So this is the massively parallel part use you sequence lots of them at the same time But you're able to do that in a fashion where you can then pull them apart again and After they're barcoded and and reverse transcribed you Throw everything into the same pot because you don't need to distinguish them anymore The molecules are then Distinguished via their barcode and you throw that into the analytical pipeline your typical high-throughcode sequencer which gives you millions and millions of reads and You can then associate the reads back to the cells and hopefully cluster them and identify the cells And it really works So this is this is one of the key findings here Once you start comparing cells against cells you notice distinctly different expression patterns So groups of cells some groups of cells are much more similar to each other than to all other groups of cells This is shown in this block structure of the cell against cell comparison So this group of cells here is All more similar to each other than they are to other cells in that group this is the hallmark of Being able to usefully cluster being able to usefully put a Continuum of cells into different categories And it doesn't have to look that way Can you can you imagine if all the expression profiles were just randomly distributed or if the cell identities was not Discrete but very very continuous and slowly varying across cell types You simply wouldn't have that block structure. You would have a noisy plot of something But but not that very clear-cut block structure. So that block structure identifies cell types And of course once the cell types are identified you can then say well these cells We label them we arbitrarily label them as population six in our sample here, but what are they? They come from the spleen, but are they B cells? Are they macrophages? Are they natural killer cells? What can we learn about these cells that allows us to identify where they come from and? Subsequently the authors have then been able to take these cell types project the expression differences in some way and Correlate that with classical Cell sorting and this is where you where where we find the traditional markers which would traditionally have been used to identify cells in time so We call B cells Where we define B cells to B cells that are CD19 positive and B220 positive and TCR TC receptor beta negative and that's something we can identify with flow cytometry and when we apply that to these to to this suitably displayed Array of of different blocks we we see that our B cell markers CD19 and B220 Identify something that has clustered in in regarding the expression profiles in this group one so this group down here is Later on identified to correspond to B cells or if we have a GR1 positive CD11 B positive These are things we we identify as monocytes and that corresponds to the the group 6 here and so on so We can we can find that indeed This clustering which is basically made without any bias or prior assumptions Corresponds well to our traditional knowledge of what the cell types within the spleen should be so this is a very satisfying Link of our prior knowledge with the high throughput knowledge now, of course the link isn't perfect There are transitional states here That are not as easily identified so ideally in an experiment like that. You don't just want to confirm What's going on what you already know, but you want to find something new and on the other hand It shouldn't look completely different from your traditional expectations, but at least have some overlap with it So that's the ideal situation here now. This is published and the data is On the public record in supplementary material at science So we can download it and start working with it and that might be a typical scenario of what some Things that you might want to do in your lab download data from a different publication start working with it or take Data from your collaborator and start working with it And there's a couple of questions, which you can imagine now working with it can be tricky Because we need to download and prepare the data and there's a number of tools which which might Need to be integrated here. So we'll start basically looking at the data and getting to the point where we can actually start making interpretations and some simple analysis now the paper is here in this link here and Since the paper from science is of course not publicly available, but has its copyright. I have zipped it up with a password Enter the password upon which it gets unloaded and you can then Access and read the paper same thing with the supplementary material and Yeah, I think that's that's all we have here or the Excel file Where is it? Oh, yeah, I didn't zip that up. So table S3 Excel S is simply the Excel data okay so that out of the way, let's Let's start thinking about Getting data into r. So of course the the simplest way of getting data into r simply typing it and assigning it to variable And presumably you've all done that in your introductory material Tutorial so What exactly a Piece of data contains can be very variable. So for example, if I assign the string one to the variable X This variable now contains Is now an object that contains a character constant or a character Here now if I look at the value of X I see that this is character because this enclosed in in Quotation marks. So a quoted X one is very different from say I overwrite this and I Simply use the number one see no quotation marks here So I get a first idea of what something is by simply looking into the environment and looking at the function These things can become quite complicated though and for many purposes We need to look a little deeper into the various aspects of our objects in order to work with them productively so there are many ways of Looking at the properties of a function. So for example, I've defined X as a single number now One of the our functions to tell me what it is is the mode The mode of this element or this vector is numeric. So this is a numeric vector This numeric vector is Has the type double Basically all numbers in our have the type double. I either double position floating point numbers or double Yeah, double position which basically makes all of the internal our calculations Very precise The class is Also numeric There's a combined function for the structure and this tells me it's a numeric vector of simply one element and Does it have any attributes? Many objects have attributes, but this one does not because no attribute is defined an attribute for example would be names of an R object so if I Take this and I combine all of this into a function Then I can have a function which gives me Basically a digest of information Various Aspects of information that's available about an R object. This is very simple because it's just a single number But things can become more complex and complicated and you'll find where that's useful so I'd like you to Even though this is a simple function, I'd like you to save this for future reference So you have a File here, which is called function template Function template dot R It's very similar to the script template that I showed you before what I'd like you to do is to take your function template and save it as Some meaningful name for example type info dot R and then Take the body of this function whatever you want here and comments And put that into the function template And or in the the type info dot R. So at the end You should have in your folder File called function. Sorry type info dot R which is based on the function template and Which will then be available Then I would like you to configure your project To load. Sorry. This is a typo here To load type info dot R upon startup We've briefly discussed on how to do things That get executed on startup so you should figure out how to do that and Then exit when that is done exit our studio start it up again and use File recent projects to get back to our intro at that time That function should be loaded and become available So these are rather high-level instructions. This is not just step-by-step. You'll have to figure out how to do this So I think we're there This this kind of customization is really really useful whenever you you start working on a project You'd like to have shortcuts to the paths where you keep your files You'd like to load particular packages. You'd like to load particular functions and so on So so this kind of work is is useful now our profile in this setting as a file within your project is Local customization it doesn't affect the global behavior of our studio at all so one thing that that you would do if you if you want to Have a particular customization Wherever you are and whatever you're doing is to define this globally for that For that purpose I would have a file perhaps called Utilities dot are in a resource file and that will contain all the paths and functions and so on and so on and Then in my our profile of a particular project. I would simply source that Utilities dot are wherever it's located So then you can have local customization specific to your project global customizations specific to your overall tastes of how to work with our or our studio and This this really makes your life easier and and once again Keeps your work more reproducible now, let's look a little bit at The data types and For that we will At first open a particular text file So where does this text file come from? here There are a couple of cells that are associated Characteristically a couple of genames that are associated Characteristically with particular cell types So this is this is an example of what unfortunately you have a lot. It's an informational dead end. It's an image There's no good way to work with it. This is a pasted image and I can't select that text or get at that text I started typing it by hand then I got tired of that very quickly and then I took that image and I uploaded it to an OCR online service so OCR optical character recognition Notionally can go through images and extract textual information and that worked to a fashion except, you know 90% correct and I had to by hand edit the remaining 10% and I think in the end I would have ended up Being faster if I would have done it by hand to begin with so this is not Untypical you have data and it's in a format where you can't really use it the take-home message here Is if you want to have your data reproducible, don't put it into images Reusable don't put it into images keep it in some way some format that the computer can can work with it Anyway, so this is where these names come from and They are in the file figure three characteristic genes dot text so our next task is to take this list of characteristic genes and Get it into a text vector i.e. into something that you know looks perhaps like a vector like this a Character vector this one has five elements with particular gene names. So if we want to look at that in more detail This is a vector And mode of character type of character and so on five elements. So the challenge is the task take this file and Get all of these gene names into a character vector How would you do this? So you have a text file here somebody's given this to you and You want a vector like this where all of these gene names are an Element in a character vector so that we can use them later on to work with them of course what you could do is simply You know type this by hand But the question is how can we avoid that pain any ideas any suggestions? use a read command is one One possibility so read command. There are a variety of read commands that open files and Read them in and work with them. Are there essentially manual alternatives? Say something else you could do Yeah, we could copy and paste the whole thing well can we What happens if we copy that and where do we paste it to? All right, let's try that. Okay, so I copy and paste this into an R script. Well, that's not yet very helpful Because you know if I if I assign that Nothing good is going to happen Because of course, it's not yet a string. So we have to make these things strings So what happens next? Yeah, well, that's the canonical and correct way to do it But like we should be able to do it this way too first Okay, so okay, so we are here with quotes So at first you said put quotes around each one at first this already is a valid command now I have this value in here, but it's not yet a vector. It's just one character element So let's assume I want a second one by the way if I select and then press quotation marks Everything that's selected goes into quotation marks right So well, if I say the same thing That's not useful because that just overwrites it. I somehow need to tie them together See see So what is seed? See is short-hand for concatenate Now I have a character vector with two elements and I could just continue doing that possible If I just have seven of these, that's probably what I would do If it's 12, I would already start thinking about something less painful Something that's less painful perhaps is to use Find and replace oh, but I don't think I can I can find and replace Line breaks because I could find a line break here and well, maybe I can let's see Should be this and replace with Yeah, it doesn't recognize line breaks So if I replace every single right line break with this here I would only need to add the first and the last quotation mark and I'd be done In a text editor I can do that but again, this This doesn't look very appealing. There's an alternative that I could think about and that starts from Putting this entire thing as a quoted string now, these are individual strings that are separated with a line break Where does that get me a little further Perhaps because if they're all separated with the same thing then I can split them apart on that and there's a very convenient function called Str, str, str, split well string split, but it's str split so string split requires something to split which is its value x and Something to split on So I can tell it that perhaps my Long character string would be separated with blanks or with colons or with commas or Something else and I could enter these whatever these are as something to split on so that looks nicer Except for one thing What's this here? That's a bit unexpected. That's not what a character vector should look like. So let's see what this actually produces So I take the output and I put that as input to my function type info and it tells me this is a list So list is one of the way that that are can store data. We have Vectors we have matrices which are multi-dimensional vectors. We have data frames and we have lists We'll talk a little more about the differences between these a little later on, but this is a list it has to be a list because String split is a vectorized function i.e. It doesn't only work on a single element It also works on an array of elements and then that it has to produce output for every element of that array so the first one would go into this double brackets one the second one would go into double brackets two and so on and Since they're all of different length We can't predict how long these elements are and this is what this is why this has to be a list But it's a list and that's not not exactly what I wanted What I wanted is a single vector so What I should have what I should be using here is a command that's Insanely useful because Very often the output of functions are lists and we Don't always want them so you can unlist them unlist takes an entire list and it flattens everything into one single vector So in this case, we then finally have our vector. So that's basically them the manual way to do it and Yeah, you know, it's it's kind of hackers to work this way, but sometimes Hackers is fast and you don't have to think a lot and you don't have to look up the documentation of how exactly the read lines or read CSV function is formatted and so on so this may get you to where you need to be faster now The industry strength solution that the right way to do it. However, it's not that I've put that Into here sample solution read text So read the contents of a small text file into our So the first option we talked about is enter the data one by one You could use a text processor to replace each occurrence of a paragraph break with the string quotation mark comma space quotation mark and then wrap the string and Assign it so this was the first version You can sort of enter it by hand, but all at once Place the whole thing into quotation marks Assign it somewhere and then use unlist and string split to assign it But you can also use the function read lines and you can also use the function read dot CSV Now these are very different functions One is intended to simply read single lines of text and the other ones the other one is very useful It reads comma separated values And that's often the canonical format in which you get data that you want to work with which comes from somewhere else very often It is shared as CSV comma separated values or sometimes TSV tab separated values There's a function R to read CSV and There's a function to read lines in this case with one element per line They work the same, but if I would have more than one element per line I would need to use read CSV or read the individual lines and post-process them by splitting them as I read them in So if I use read lines Well, this is now uneventful because it's just a character vector. It looks the same as it did before obviously if I read CSV This is slightly different Actually, I should call this Let's look at the difference here because there's an important distinction between the two Read lines gives me a vector and every element of that vector is one line from the input text file but What does read CSV produce? How would you find out what it produces? Where do we find that information? question mark read dot CSV So this has this data input help page refers to a number of functions that have a large number of options which which are important and They're all kind of similar all kind of in in this You know have a similar purpose. So they're grouped in a single help page So one help page can contain more than one function in this case. We have read dot CSV here it explains the arguments or parameters what they are and The differences between them and this the value The value is the output of the function that is what the function returns the function returns a data frame So what's a data frame? Anybody remember? What is a data frame? Fundamentally as as basic our objects we have vectors Technically even a single element is a vector in our it's just a vector of length one but vectors can be longer and We have matrices which can be two or three or more dimensional vectors The limitation for these is a vector can be character It can be a floating point number. It can be an integer can be a Boolean It can contain functions any kind of our objects can be contained in a vector But all elements in that vector have to be of the same type So we can't mix Booleans and characters and numbers it all has to be of the same type and the same thing holds for Matrices two-dimensional objects everything in a matrix has to be of the same type If we don't want everything to be of the same type we can use data frames or lists If we want everything in a single column of the same type Data frames is what you is the the convenient thing to do So a data frame in R is very very similar to the notion of a spreadsheet that we would have say in Excel for example, we have different columns of logically related information and different rows of entities that that have the the instances of this information so For example a data frame can have Strings as the first column and Numeric values as all the other columns the strings might be gene names and the numeric values might be Expression values and we'll come to exactly that example a little later on however Everything has to have the same type within a column. We can't mix types in a column and secondly all the columns have to be of the same length and If that's not the case we have to use lists lists are completely free We can mix and match everything we want in there as long as we can define how to address it and how to how to Identify the substructure within the list. So the list is the most flexible, but perhaps also the most idiosyncratic to work with Probably for general work Most of what you do in the real world is going to be with data frames Most of what we do in here is with vectors because we simplified Simplify things a lot, but basically let's consider the data frame to be The paradigm of how we work with data and are now read CSV produces a data frame and we can see that if We compare our function type info So this is the characteristic genes of lines of read lines It gives me the the values it tells me it's has 46 elements and they're all Mode character and type of character IE a character vector This is very different output So the CSV is a data frame the data frame has one column which is v1. I didn't I didn't Specify a column name so it created a default column name for me More over they have Individual row numbers or row identifiers It's a data frame 46 observations of one variable and This one variable v1 is a set of character elements It's the mode type of list and the class of data frame So this class data frame is crucial for our to recognize what it is and Then This this class signature allows Functions downstream to access these objects in the right way. So we don't have to worry about the types If somebody writes a function that in principle takes data frames as an input It can recognize that this object is indeed a valid data frame and then just use it So that's that's very convenient There's an attribute here dollar names, which is which is the column name and there's an attribute row names Which are just the individual numbers we can change these to other values and we'll we'll do that a little later on Okay, so these are some thoughts and observations on how to get Simple text data into R now We've we've worked a little bit with with text We've created a text vector. We've read things into a text vector and we've used string split to break a Longer string apart. Let's consolidate that a little bit in order to label Specifically when you are working say with phylogenetic trees in order to label genes That you compare Say you got them from the database. It's often crucially important to keep track of what organism they came from and To have concise labels That allows you to plot some information on on a larger scale plot It's often very useful to take the binomial name That can be very long and condense that into a few characteristic letters So as a shorthand mnemonic code of where this particular string came from So I'd like you to write a function Based on what you currently know That converts a binomial scientific name into a five-letter label So if the binomial name is homo sapiens The label should be H O M S a if it's trozophila melanogaster. It should be D R O M E So this simply generates Are you Projecting what we do in here out to the public They're getting very excited about this Okay, I Okay, so first I'd like you to figure out how to do this in principle So perhaps you start off with simply defining some string s and Manipulating that and working with it and Trying to figure out how do we access the individual elements perhaps use string split and how do we extract? individual characters from these elements part of this is I'm not telling you how to do it because If I just tell you everything you need to do this is not going to help you because your real-world problems are going to be different Anyway, so part of the challenge here is how do you figure out how to achieve something like that? It's pretty easy to imagine there should be a way to get characters From a word say the first three characters or the last five characters But how this is for you to figure out and thus develop your solution strategies If you despair there's a sample solution Sample solution by codes, but leave that alone for now try to solve it on your own This will take us into the lunch. This seems like a good opportunity to Revisit what we've been doing here and and and why it how it actually worked so the ultimate goal of this little exercise is to Help you to learn to structure a Problem into individual steps and to find how to implement these individual steps in code And we're going to try to do that again and again and again. This is really key to it in order That's the beautiful thing about working with software and working with programming in order to have any software any program work at first you need to structure your ideas and Usually they get a lot better when you do that now Well, let's have a look I'm going to open a new string a new file Script file to solve this problem. What should I do? Okay? So the first thing is do I write a script file or do I write a function or do I simply write a series of commands or What should I do help me out here a function and what would the function do? Yes, but what would the function do? Right, but what is the the function of the function after? How would I use that function? How would I use the function? Not what does it do? How would I use it? script or on the command line once it's defined, but What do I feed the function? What should the function return to me? That's my idea of how would I use it? Okay, so my input would be a character string. Yeah, and my output would be a new character string, okay, so Let's Make a function give me a name What do you call it? Just a little s little s usually I like to have my function names a little more explicit than that I use single single letter lowercase variables only for variables that I don't care very much about and That I simply use as in as intermediate place. If that's not the case Spend some time to give your function and variables a name that you can immediately recognize It may be it may mean more typing as you go along But the guy who profits from that little more typing is you Half a year from now when you're going to try to read and understand your code and you look at s and think to yourself What was s again? so Make it explicit. There's there's a Very true rule code is read very much more often than it is written So make it easy to read so I Could call this well since we're trying to make some binomial codes. I've called this by code in in previous incantations of this okay, so I need a function and That function takes Some input some variable input and that could be s Just whatever goes in S is nice here. It's just short hand for a string. It will What should my function be doing? Now if you read this little specification carefully You may notice that this is uppercase and this is lowercase It doesn't really matter, but for what I had in mind here. I would like to have a convert to uppercase Because we can do that Now if I think about it, I'd probably want to convert it to uppercase before I split it once I've split it into words I'd like to retrieve the first three characters from the first word and then I'd like to retrieve the first two characters from the second word And then I look at this and say I don't like that Because what I don't like about it is this first three and first two It's this, you know magic numbers. Where does that even come from? What does that mean? so I prefer that and After that's done paste them together or assemble them together somehow and Return to so this is how I break down my task into individual steps now all I need to do is to go through my little list of individual steps and Execute them in code one by one and that should do my function. So the first thing is Called to upper this simply converts the string s that comes in here and Converts it to uppercase How do I check that this actually works So when I when I define some input like that and I code I could then use the debugger and step through my function step by step and look at all the intermediate results, but what I end up usually doing in practice is I define a Variable of the expected type with the same name here And then I can just go through these these components of my function and and test them So I'll make a little s What's your favorite organism that we haven't used so far? Something new. Oh, that's such a nice. Okay. So this is my SNR Good. So does this work test it here? Yeah, this gives me the expected result So you see s is gorilla gorilla and if I execute the whole line Now my s has changed and is now all uppercase Good next step split the input into words What should I be writing string split split split what split s and Split split on the blank character Okay, and that as I see gives me a list So what I'd also like is to unlist this now this now gives me a vector with two elements and I can access the individual elements in the usual way With square bracket notation So for example, I can just invent a new variable t to which I assign this And then I can say t1 and t2 Well, that's the same thing. Did we do something wrong? Oh, it's correct. It just happens to be twice the same word So t1 and t2 in this case are identical Now I Don't actually have to use an intermediate variable here the result of this Of this expression here is a vector Even though it hasn't been assigned and it doesn't have a name. I can still Access components from that vector with the square brackets so if I'm desperate for writing very concise code and Saving characters and lines. I might be doing something like this I think you'll agree though that even though this is perfectly valid. It is also slightly opaque So whenever I find myself doing stuff like that, I usually say well, maybe it's time for an intermediate assignment So assigning it to something that I actually know Because after all if I ever want to debug this and everything happens in this one expression in this one line it becomes very hard to isolate the components where things might have gone wrong, but Valid syntax it is I can take the output of a function and simply work with it by extracting subsets substrings or whatever. Okay. Now our t contains Two elements of a character vector gorilla and gorilla and I want the n first characters from the first word So let's define what n first and n second is Obviously, you can just write the actual numbers somewhere, but you know again, this is Magical numbers and so this is just Making these two variables explicit and of course if I define it that way and These numbers may appear several times in the code if it's ever used for something else It becomes a lot easier to change things later on. Okay, but how do I get? Three or whatever n first is characters from The first word string trim. Okay. I haven't used this it seems to work Okay, what do I get here and why so first of all t contains Two elements and apparently string trim operates on both elements First one and then the other Secondly string trim takes Apparently the first three letters and that's implied. I don't actually read it read this here That it's not the middle three letters or the last three letters, but the first three letters But other than that, it seems to be perfectly usable. First of all, I need to make sure that it applies only to the first word secondly, I Should be using n first of course making sure that n first and n second are actually defined So that seems to give me the expected result Now The canonical way that I would have used this substring But there's an important caveat here if you use Chinese Japanese or Korean characters or actually any other characters that are encoded in unicode The characters may occupy More than one character width. So a typical Chinese character is equivalent to three Western characters and Functions like substring can get very confused with that and can't figure out where the actual character boundaries are so apparently string trim takes care of that and If they're double width or triple width characters, then this is preferable Nevertheless, I think substring is the canonical solution Substring X start and stop so Substring of the first element starting at one and stopping at the third letter or In a similar way the second element and doing this now in order to use these further We could assign them Or we could also paste them together directly now there's there's essentially two ways Two and a half ways to paste the first two ways are to use the function paste or paste zero paste zero so the difference between these two is that paste by default pastes everything with One blank character separation, whatever you you give it and it can collapse things together So paste is especially useful actually pretty much indispensable if you want to paste the entire contents of a vector or matrix and Concatenate all that for example if you have a vector and you want to make a comma separated values list You can use paste on that vector and define a comma as a separating character Paste zero means just collapse everything together without any intervening space Probably the most versatile Similar function is This one here It's printf. This is a function that that Is basically inherited from a C function the programming language C printf Which is basically a formatted way to print and it has its own inner syntax and logic with which you can get Very very precisely formatted output of letters and numbers and strings so Just just to show this to you a sprintf takes a string for its output and I can put anything I want there and then With a percent sign I can specify The type of contents that should be placed within that string so percent s is for a string I have two percent signs for percent s and Now I need to put two variables there That correspond to these percentage signs so I could put the result of this here and And The result of that and I get this here So the the first percent s has been substituted with this substring expression the second percent s with that substring expression especially if you want formatted output with Various ways of numbers where you want to control how many digits How many significant digits are supposed to be printed to output and Some explanatory text on what you're printing here and some comments This is usually the most convenient way to do this this percent s or percent d or percent f 1.5 or whatever these codes are This is a little bit of syntax to learn but after that you have the greatest flexibility Now for for our very simple Purpose here paste Or paste zero should probably be the simplest way to do this paste zero just paste these two together Paste is essentially the same thing, but by default it adds a blank So if I don't want the blank I need to turn it off and define the separation character to be the empty string So either of these work for our purposes I Didn't actually know about paste zero until I overheard Lauren using it this morning It makes sense Usually, you know, I have a very limited memory So I try to do as much as possible with as few functions that I need to remember that doesn't always lead to the most efficient code, but Yeah, it's it's a certain kind of economy So I for me it's easy to remember paste and that I can turn off the or specify the separating character and Generate a second Intermediate variable and then return it now returning Values from functions is not actually required If there is no explicit return statement the function will simply return the last evaluated expression or the result of the last evaluated expression I Never do that though. I believe very much in making code explicit So I like to see that at some point when I read my function I would like it glaringly obvious what it is that's being returned Not just as a side effect of something being evaluated inside the function I want it there explicitly in my mind. This makes for more readable code and that's why I write things this way So essentially this this I think is it probably Works without an error. Let's see if it works for another organism You said any favorite organisms there? What is everybody working on everybody works on homo sapiens? Okay, so that seems to work as expected now something we didn't do here and That You can add is think about what happens if there's only if there's no blank Will it just crash and burn or what happens then and if that's a possibility? How can you? prevent that What happens if there's more than two words? Like homo sapiens neanderthal lenses, what do you get them? Is that what you want? Can you change it and so on so making sure that? Your functions Don't necessarily make too strong of assumptions about what they are working with and and are able to handle Commonly expected Special cases this this is just part of the game now Yeah, so in my sample solution Which I also posted here. I actually didn't use unlist, but I accessed the list elements directly So this is the syntax for accessing the first element list element Which is a vector and from that vector the first vector element and here the same thing Okay, let's move on lists Let me skip over this list section Of course, you're always welcome to revisit this this code at some point if you find that you're revisiting something and and Whatever you find in this script file makes no sense to you whatsoever and may be wrong and misleading Then by all means do email me and I'll I'll I'll try to fix it together I'd like to move on to data frames now data frames really is our our Workhorse there are a matrix or a set of data and This can be a list of vectors and or factors of the same length that are related So data in the same position in the rows comes from the same experimental unit a subject an animal SL line whatever So here's an example for Generating a data frame so data frames are created with the command data.frame and I can specify the column names and the column contents as I Define it so in this case, I have three genes with three expression values and To one of these genes is induced and two of the genes are not induced So I put that into my data frame now if I simply type my data frame or or This is this is how a data frame is typically out the column names are listed And these are the individual rows as you see if we use our function type info The columns actually have different types So the first one is a set of factors with three levels a bc1 for 31 and qrz The second column is of type numeric and the third column is of type logical. So what is it with with factors? We don't very often run into factors these days except inadvertently by far the most common Appearance of factors is when they come up in data frames Where they are generated as defaults from Character data So if you read in a large file with character data and there's Then all of these character data are converted into factors now normally factors are things like sample and control or male female or Adult juvenile or these kinds of things Categorical variable that you can apply to your data and factors are really useful when you want to for example do When you want to do regression analysis on categorical variables So you have a multi-dimensional data set and then you want to say well Does it is Adult juvenile a good predictor for disease onset and then you can use these factors and run a regression on them and and find The predictive value But in this case this is not we had intended what we had intended was that these are genes and not factors It doesn't make sense to specify them as factors In fact if our gene names are unique will have exactly as many factors as we have gene names And there's a lot of complication in that the comp the most important complication is that internally These are then stored as numbers where the numbers point into a large list of names So if you want to get them back out again This can be very difficult I've put a section on on discussion of factors elsewhere in the script But otherwise I'll skip over it what I'll say here is that by and large usually we don't want factors when we mean strings and Therefore it's important to add this little Invocation here strings as factors equals false Every time you create a data frame So this is one of the inconveniences of our and I actually do it this way There's an alternative of turning it off globally But then again, there's no guarantee that not some package author somewhere will actually be using factors and will be Relying on the assumption that the factors are there by default And then I've turned them off globally and and things will will break in unpredictable ways. So again Being being trying to be explicit here. I simply type strings as factors is false whenever I create a data frame or Add to data frames or convert things as data frames So now if we do that We have the same data frame, but now this is a Character column and not a column of factors So remember that whenever you read in things as a data frame You should in almost all cases be turning strings as factors off unless you actually mean factors One of the most common appearances of confusing factors incidentally is when you take an Excel spreadsheet and There's only numbers in the Excel spreadsheet and then you look at your type info and you find one of the columns is turned into factors But you expected only numbers and they they really should all be numeric So what happened? Well, what most likely happened is that somebody entered Say a decimal number with a European comma instead of a decimal point or that there was a missing value and somebody typed n slash a into that So all of these are strings So a number unless you are in a European locale a number with a comma is a string and not a number or n Slash a does not mean not available available R has This here for not available and it has the specific meaning that there's there's missing data But n slash a is simply a string now. I've told you that in data frames Columns need to be of always the same type So when the read dot CSV function reads in a column of data with numbers and then encounters a string It has to do something Can it convert the string into a number to keep the column consistent? No Can it convert all of the other numbers into strings to keep it consistent? Yes that it can do So then it takes all of the numbers in the column because of this one comma Takes all the numbers converts that into strings And then if you didn't switch strings as factors off everything is converted into factors So the bottom line is it's it's really important and and worthwhile not to blindly Rely on the conversion of data as it comes in but to use a function like type info and Explicitly check that you're getting what you think you should be having now let's look at some of the expression values from the height in it all paper and We'll just read in Supplementary table 3 From the science website, which is an Excel spreadsheet so you'll very often get data in Excel spreadsheets and I believe this is Actually, no, it's not unzipped. It's there in plain version and it's table s3 dot XLS Now I don't even know what would happen if you click on that Should I try it? Is it going to crash my computer? No, it's smarter than that it recognizes that it doesn't internally know what to do with XLS files, but it kind of Recognizes the the the extension as an Excel spreadsheet So it loads this into Excel now Do you want to work with Excel to look at your data? maybe Excel is an excellent spreadsheet program It's a terrible statistics program many of Excel's statistics functions are actually wrong and It makes horrible and ugly graphics. So Excel plots are usually Yeah, not very nice so We'd like to read this data into our to actually analyze it and a convenient way to do that is to save this file as I'm an oh here comma separated values if you save an Excel spreadsheet or if it has several sheets several workbooks in one file as comma separated values the result is a text file and The text file is something that you can read in our now are also has specialized functions that actually read Excel spreadsheets more explicitly but I would I would caution against that First of all the Excel format is is not an open format to begin with and all these programs Kind of work on reverse engineering and and inferences. I have no idea how robust this actually is Whereas if I save something as a CSV it's very easy to then go through the CSV file and as certain that all the numbers in one column actually are numbers and that you Have that you have what you have and that you actually get the values and not the formulas behind them And that nothing is affected by Formatting conditional formatting or old and italic and so on so unless you're actually Desperate on working with some of the embedded functionality of this of the Excel spreadsheet Put it through a simple purifying step of text and comma separate separated values So simply open this and save it as a CSV file if for some reason you don't have Excel on your computer There is a sample solution of the CSV file Which looks like this? This is now comma separated values Now if we go back to this here We might notice that There's some problems here that prevent us from reading this properly to begin with Of course, we have our gene names and we have expression values and we have some explanations of what these expression values mean but we also have table s3 and we have a Table caption here and so on so in order to be able to read this into a data frame We'll need to get rid of some of this information So our next set of tasks is to load the data of This Excel spreadsheet into Excel and save it as a comma separate values file Examine the file with a text editor like our are is actually not a bad text editor at all And then read the table into our and assign it to a variable Now when I do that There's additional work that I need to do with with the table usually like take care of the headers and add some Different column names and so on and so on so I usually just call my first read raw dat a Data frame with the raw data and then after I process it. I Assign it to a different name the advantage is that usually while I work with it and and Fix all the problems that it potentially has I Usually make mistakes and then I need to revert to the original version and rather than read the original from file every Every time this this can be long because sometimes the files are very large I keep them in memory until I'm happy with the analysis Now read this in with The function read.csv assign it to something and Then use the function head To look at the contents of the first set of rows You might then want to remove any unneeded header rows. So skip headers that aren't needed You might you should give the columns different names that actually reflect the cell type so For example, this is cell type one which is These are expression values for untreated or two hours after Stimulation with lipopolysaccharide lipopolysaccharide is a is a cell surface antigen of bacteria That stimulates the innate immune response So these cells being immune cells react differentially to the presence of lipopolysaccharide So we'd expect some of the genes that are important and implicated in that response to be Upregulated and stimulated when you add LPS Now this just says one two three and four, which is not very useful. So perhaps Go back to your paper and In this figure here, this is these are the Roman numerals of the cell types that this refers to you should be able to Identify which ones are B cell which ones are macrophages which are natural killer cells and perhaps Use meaningful column names So once again the task is open this Excel file save it as a CSV read that CSV into an R object and then Give your R object your data frame in R Meaningful column names to reflect the cell type and the stimulus status and Analyze what you got. This is the great challenge of getting data into our I want to keep all the columns But as you'll see when you just read them in and skip over the header information They'll be all mangled up or just be v1 v2 v3 v4 Right so you'll need to figure out how to how to assign column names and How to skip lines that you don't want to read into your CSV object the best way is trial and error just You know code something right it there and watch it break and then figure out how to fix the break Now regardless of where you're at at the moment I'd like to briefly take you through my sample solution I hope this clarifies any remaining questions and then you can just continue writing it up on your own So in my sample solution here It's actually called differently The first step I do is I read this into a constant into a variable called raw data read CSV This needs the file name I tell it that I don't want to actually use a header But I simply read everything in there and I specify that strings as factors are false So now this raw data appears here a convenient way to look at this is to click on this spreadsheet icon here and This opens the spreadsheet But of course Another way to look at it is The head function which by default gives me the first six lines but I can tell it to give me say the first 20 lines with an extra parameter and Look into my my data more explicitly now If I look into into this I see a number of problems since I I Skipped the header none of this data none of this information is actually used as header information But our substitutes v1 v2 v3 and so on for that So ultimately I want to replace this with meaningful column names The next problem is rows one to six do not contain data But various kinds of table headings and explanations so I'd Like to remove columns rows one to six actually I have to remove rows one to six if I want to treat the data below as numerical because if there's any non numerical Data left in a column the entire column cannot ever become numerical So if there's even one character The whole thing will remain character And I can't curse it to be numerical which I want to do later on because I'd like to you know Subtract and compare numerical values There's not a single row here that I can that I can use for column names So I need to invent column names somehow if this row six would be unique in every cell Then I could simply take that and use that as column names often That's the case and then that may be the most convenient here. I'll actually need to make up column names and If I look at the structure of this Then I see that all of the columns are character columns because of the way it was read in and I'll need to fix that so I need to convert the numeric columns into numeric type Okay, so the first thing is I come up with a name which I'll call LPS dat lipopolis acryl data and I'll just drop the first six rows You might remember from the introductory tutorial that If we access vectors we put we can put indices into square brackets And if we access data frames or matrices two-dimensional data objects the rows and columns are separated with a comma if The space after the comma is empty or there's nothing there then it means just take everything Other programming languages for example pearl would use a star there as a wild card are simply uses the omission so Moreover if I Specify a vector of numbers a Vector of negative numbers then this means Remove or drop whatever indices these numbers are so this expression here Is minus one minus two up to minus six and this means not row one not row two not row three and so on so Here we go Okay, now this looks a little more manageable life. I've got rid of all the craft, but it does start with the first data Now the next one is that I need column names, so I'll take the cell types from figure four of height in it all and I Will concatenate that into a string vector, so the first column is genes I eat the gene names then be control BLPS microphage control macrophage LPS and so on and The last column is the cluster in which they're assigned So which for the first ones is one, but it gets to be different clusters then so if I do a head again I now have genes be control BLPS and so on right, so I assign call names by Taking the column names of LPS that and assigning a vector of the correct length of that Similarly for row names now. What's with the row names? Well the row names You see these are not actually row numbers These are row names if they were row numbers We would start with one two three four five because these are the rows of our data frame But they've inherited the row name from whatever they used to be previously and there we dropped rows one two six So this starts with seven now this can be confusing because If Yeah, if I want to find a particular row and and they're all offset by by seven elements I'm more likely to be wrong So I would like to replace the current row names with simply a sequence of numbers that start at one to fix this row name problem Now how many rows are there the command n row? Gives me the number of rows. So we have a thousand three hundred and forty one gene expression profiles here and This range operator simply gives me the number from one to one thousand three hundred and forty one Which I can then assign into the row names So let's look at that result again now it starts row one two six and Just to make sure we'll use the cognate of head to look at the back of the file, which would probably be called tail Right 1340 1341 and there it ends and I can also look at the entire type info Here we go. Now Why are these no longer character, but we're correctly converted into numeric? essentially this happened when we dropped The first six lines the first six rows here Okay, and there's no factors We have characters in the gene column and we have the expression values Now look at look at that sample solution It's it's within the files try to compare it with what you've been doing If there's anything that you think you have been doing more elegantly or differently And you're confused about why it's different and why I'm doing it in a particular way. Let me know and let's