 All the material in the workshop is organized as RStudio projects, and this is why we were so adamant about actually getting the pre-work done because you'll need to download an RStudio project to get started. The project we'll initially work with is here, so HTTPSkitab.com, Hogan, and RIntro. So what you need to do, as you know, to load such a project is you go to File, New Project. So you go to File, New Project, open New Project under Version Control, with Git, type this address, and put the project folder into your course folder. So in your pre-work, you've already done the very same thing with an introductory project. This is, our intro is the one we need for this workshop and course. When you are done with that, kindly put a blue post-it on your laptop. If it didn't work, wave your hand or put a red post-it on your laptop. If it didn't work, wave your hand or put a red post-it. I see some blue post-its. Go to RStudio. Don't go to that site in Chrome. You need it for opening the project in RStudio. All the pre-work was done in RStudio. Do you know how to find RStudio on Windows or RStudio? Most of us use Mac computers. Lauren is our resident Windows expert. RStudio scares me. If the project has started, the first thing I'd like you to do is, as the startup screen probably said when you launch the project, is to type in it. I've typed this in this pane up here, the top left pane and I've hit return and of course nothing happened. If I want to execute this function, what do I need to do? Either control enter, that will execute things from the script pane or type in it into the lower left pane which is the console. If I type it into the console, I just hit return. If I type it into up here, I hit control enter or on the Mac command enter. This will run a little script which will initialize two functions and probably create two files that are called my intro notes, my sequence analysis and my data integration. Let's have a brief, first of all let's check whether everything is set up correctly. Your project folder contains all the files that you see on the right hand side, on the lower left in the file menu. But R also has a concept of a working directory, i.e. by default, all the files that you're looking for or all the files that you want to load or work with are in this working directory. Let's check whether the files that you see in the file pane on the lower right hand side are actually the files in your working directory. Do you remember how to get the working directory listed? There's an R command, getwd and this displays a string of where on my computer the current working directory is and this also happens to be the directory that I know where this is installed but I can check and make sure that this is correct by figuring out what the actual directory listing is and that's a command to list files. Did we cover that command and the command is list files, get files? I can't remember. List? List? Okay, so I type list, and this pops up, once I type three characters it gives me a choice of different commands that start with this. So I do indeed need list files and I can either tab or click on that and it brings up the command. Now list files is a function, all our functions are followed by parentheses and I can put things into the parentheses, these are the parameters to the function, that's what I want the function to work on. The parameter is a function like list files could expect is for example a directory, what I want files listed in but if I don't put in parameters the functions usually have default behavior, the default behavior here is to get the files from the working directory. So when I execute this, I get a listing of the files that is approximately equivalent to the listing that I get on in the files pane. There's one difference, do you notice the difference? Yeah, some files are hidden, so list files like normally the directory functions in your operating system, Windows or Mac, doesn't show the so-called hidden files. The hidden files are files that start with a period, with a dot at the beginning. The files pane graciously shows us all the files that exist and that's really important. But list files by default omits the files that start with a dot. And these files have some special behavior. For example, our profile in your working directory, if it's a project directory, is a file that is automatically loaded when R starts. So for example, our profile here is this little file that defines a function and then prints out some strings welcome and type in it to set up the session. Okay, now the way we're going to proceed in this workshop is to practice mostly principle of development. In the five minutes preceding, I've mentioned five or six different R functions and R commands and there are hundreds by default. And there are thousands if you include all the special commands that get loaded from packages and there are literally thousands of packages that extend R's functionality. It's impossible to keep track with everything that happens in the R world. So that's not the way you do it. The way you work with R is not to try to memorize all the different functions. The way you work with R is to adopt sound principles of solving problems, taking a problem, breaking it down step by step systematically and then going through the steps individually and figuring out, well, how do I express this in computer code? How do I express this specifically in R code? The way that I teach this is actually something of a subset of R, not using all of the R functions that we could use to make maximally efficient and maximally concise R code, but it's kind of a somewhat pedestrian way of doing things for several reasons. One of the reasons is if we take things and rather than using functions that do everything in one expression, but rather do it step by step, it's much easier to troubleshoot and to verify that what we're doing is actually correct. And that's hugely important. The worst case scenario is that you think you're doing something right, but you're not doing it right and this will lead to a nasty letter from the editor of a journal to your supervisor that says, this paper has to be retracted because the data was wrong. You never ever want to be in that situation. So validating things and making sure what you're doing is correct is the most important thing that you have to do. The complexities of actually getting things right, especially in bioinformatics, are actually huge as you know and as we'll encounter in this workshop. So taking things slowly, working through things step by step allows you to validate through every single step whether what you're doing is what you think you ought to be doing or what you think ought to be happening. And that's really important. The other thing is maybe 10 years ago we would have still discussed what is the best language to learn for bioinformatics or what is the best language to use to do this. Well, nowadays we don't do that anymore. We realize sometimes we need to program and compile code in C++. It just has to fly. So we need a fast compiled language, C++. Or we have to contribute to a large enterprise scale project. So unfortunately we'll have to write some Java code. Or we're collaborating with people who are working in Python. So we start writing some Python code. Or we're doing the fun data science analysis and machine learning and mangling your data and that's where R really shines. So there are different languages that are good for doing different things. And we kind of have to be able to read all of them, understand them. Usually we all have a favorite computer language that we go to for our everyday tasks. But really the particularities of the language we should be agnostic to that. Now that means if we write in a language that and write very, very idiomatic code, our code is locked in. People who have very little R experience but come from Python might have problems understanding our code. These people might be your summer students who can't read your code, who are learners. It's like if I go to an international conference and I say something like this came out of left field or any other of the many, many North American baseball idioms, I'm likely not to be understood. Because idiomatic use of language is nice and is elegant if you want to write a novel. But it's not very good for sharing information. So the subset that we use here is actually very similar to the same code that you would write in Python, for example. Some syntax differences, but it's probably really, really easy to take it command by command, expression by expression, and simply translate from R to Python. So this is one of the goals here. Making things slow, making it explicit, not being too idiomatic, not using all of the possible shortcuts that R provides and being very, very explicit about it. That's actually a supremely important part of software development. If you wanted to work, make things explicit. Translate implicit knowledge into explicit knowledge. So the way that we'll proceed here is we'll define a couple of tasks, like many projects. I'm not even going to tell you how to solve them. You solve them on your own. I'm sorry. That's the way life works. Nobody is ever going to tell you how to do things. Well, at least not in a sustainable fashion. If I just show you code that works, you'll maybe learn a little bit. But once you step out the door, it's going to be very difficult to adapt this to the kind of code that you actually need to run at home. So we'll focus on problem-solving strategies on how to find answers and how to implement them. There's four little mini projects here. Not all of them are on the GitHub project yet. But in principle, we'll be working on simple sequence analysis first. We'll go through a larger project of data integration. We'll work with numeric data, in particular protein structure data, and we'll be using some of the bioconductor tools to round it off. So these tasks are all contained in various R scripts. And we may update them, or actually I'm sure we will update them during the script. And that's actually very easy. This is a great paradigm to share work. The R project lives on GitHub, which is a publicly accessible site. GitHub also offers private spaces if you need to share confidential work with collaborators. But you need to pay for these. The public sites are free and they're accessible by everybody. So you download projects from GitHub. If you own the project, you make your changes locally. You commit them to Git. And then you upload the changes to your master copy. Once your collaborators or the people in your class then download this, all of the new updates are shared into the project. So I make changes. I upload them to GitHub. You pull the new version of the project from GitHub and you have the updated files. Now there's a caveat to that. If you edit R scripts that I edit too, your edits will be overwritten. And that's not good. Because what you really have to do during this workshops is take lots and lots of notes and write things and comments and code experiments and all of that. And if these would be overwritten every time I update a project, you would be rightly very unhappy with me. So what the init function does when you run it, I've written it in a way where it takes the R files that we will be working with and makes a local copy for you. So these files are called my something something. For example, my data integration, my intro notes, my sequence analysis. So these are derived from data integration or sequence analysis.r. These are the project files that I've created. But these my whatever files are the files that you actually work with. You edit them, you put your code experiments into them. They're not going to be overwritten because they don't actually exist on GitHub. They only exist locally. So they don't get overwritten. So save these files but don't commit them to version control because that will create a conflict. Now keeping a journal is supremely important for this. The important things that you take home are not the scripts. The important things that you take home are concepts, ideas, attitudes, little tricks that we've talked about that are not obvious that you won't easily find with a Google search. And these are things you write down as notes. So write as much as you can. You have notepads. Maybe some of you have really gone out and got themselves a nice journal for writing their important workshop diary. So code examples, task annotations go into these my whatever files. But concepts are much better paraphrased and then handwritten in your journal. And you will find that this dramatically improves your focus and your understanding. I go to a lecture. I always, as the lecture goes along, I'm writing. At the end of the lecture, I may never look at these notes again. But if I don't do that, a day after I walk out the door, I've forgotten what the lecture was about. If I write it down, that doesn't happen. I don't know. Maybe this is just me. Maybe it's your experience too. If you've never tried it, try it in this workshop or maybe you'll discover something new and very, very valuable. So let's have a brief look at what's in the box here, what we have in the files folder. So these top files here are files that work internally. Our profile is the first file that gets loaded when the program starts. init.r is a file that actually initializes your session. There's a reason why I don't initialize through our profile, which is a bit technical. There's a folder for old materials. The workshop that I'm teaching this time is completely redesigned from everything I've ever done before. In particular, in the past worked a lot with just going through code scripts and explaining scripts. I'm not sure that's the best way to learn. There's a place for that. But I think what we need in this introductory tutorial is more things that you do yourself. So I've completely changed this. There's a folder with assets, which actually just contains a PDF, which we might discuss later. There's a folder with data, which has some data files which I'm providing. Some of these overlap with files that I actually want you to download. So we can use the ones that exist in this folder as a backup. But some of the files I'd rather like you to discover where they live and how to get them from the web. These two files are the source files for the R units we'll be working with. There's a file called function template. If you write your own functions, here's a template on how you can do this. And we'll go through that later. I think we'll have a task of putting a function into this function template. It includes headers that describe what the function do, what the parameters are, what the resulting values is, and so on. Similarly, there's a file called script template, which looks very similar, but is kind of a template that can get you started on working through code in your own projects that are told in a systematic fashion. There's a folder R. Typically, the folder simply called R contain R functions that have some use in your project that you've written for your particular project. There's a folder called sample solutions. Don't touch that. This has my sample solutions. But you're not going to need that, because you will write these solutions on your own. You don't need my stuff here. So this is just for me to check back if I get confused and don't know what I'm doing. And there's a folder called tests, and we'll discuss a little later about how to test for correctness in your codes. So our first step is going to work with simple sequence analysis. So what I'd like you to do is to open the file, my sequence analysis. And this will allow us to recapitulate ideas that we have encountered in the preparatory workshop and extend them a little bit. Now, in particular, we will download 100,000 nucleotides from human chromosome 20. If I do a kind of genome annotation practice work, I like to work with chromosome 20. It's plain vanilla chromosome. Not one of the largest ones has about 500 genes, so it's kind of manageable, not one of the big ones. So let's consider these 100,000 nucleotides from the human chromosome, and let's figure out the dinucleotide frequencies and consider whether the dinucleotide frequencies are random. And at the end, we want to come up with a plot that kind of looks like this. So if you source this command here, this is the plot we'll be having at the end. It's a dinucleotide frequency plot. We have the dinucleotides here. We've sorted them by frequency. We have the frequencies computed. We have observed, and we have expected frequencies. And that's what we will end up after this little tutorial. So of course, if we want to do something like that, we start with reading the data. So this is our first task. Find a source to download nucleotides 58,815,001 from the HG38 assembly of chromosome 20. So what's HG38? Yeah, who uses this identifier? Do you know where it comes from? Not really. This is from UCSC. UCSC has the assemblies as HG38. Do you know what the previous assembly was at UCSC? 19, HG19. So why did it go from HG19 to HG38? The thing is, there's actually a human genome reference consortium that updates the human genome assembly. It's a product that's still in flux. But the reference assemblies have gone through many, many different stages. And the prior version used to be GRCH37. I will need to erase this. We'll need to recreate it, but I need the space. Sorry. So the previous assembly was GRCH37. And UCSC, who've done fabulously important work in integrating genomes and making the data available and making the data visible, have also gone through different iterations of their internal data. And the corresponding one is HG19. And people always got confused, why is one 19, why is one 37. So the next iteration, the currently current human genome version is GRCH38. And this corresponds to HG38. Is there a better marker? So we would like HG38 of chromosome 20. We want a source for particular nucleotides. We want to download them. We want to have them on our computer. So how do we do that? How do we do that? Any suggestions? Where do we find nucleotide data? On the web. Not with our. We're not even using our yet, except to take notes. NCBI is one source. UCSC is another source. Assemble is a third source. So which one should we use? OK. So let's spend, it may matter, because the difference is that they all have the same data. But it may be easier or more difficult to actually get the data from there. And to get the data in a programmatic way. So why don't we spend like a minute and randomly either choose UCSC or NCBI or Assemble at the EBI. And try to just, you know, try for one and a half minutes to get that requested data. See how far you get. OK, was anybody actually able to access the data? It's amazing. You know, if you have a room full of smart people, lots of things can happen. So tell me, what did you do? I went to UCSC. OK, OK, I never remember the URL. What's the UCSC genome browser URL again? Genome.UCSC? Yay, OK. So you clicked on Genome Browser. And by default, we have GRCH38, XC30, and something like this. At the top, under View. Under View? DNA. UDNA. OK. And get DNA. OK. And now we need to download it, right? I guess, like, for what I do, you need to copy paste. But so far, this name is not. This probably would work with copy and paste, but we'll try Save As. File. Save As. It says Page Source or Web Archive. Ooh. This is almost what we want, except it's in an HTML page, and we would need to edit it to do that. So if you do this only once, you're done. You can edit this and then save it in the way that we want. But isn't there a way that we can download it as a text-only file without tags at the end and where we actually have to edit the file? Which might be somewhat onerous if you're actually downloading a whole 100-megabyte chromosome. So that's the point at where I said, maybe you, CSC, is not the best solution for that. Any luck with something different? Any luck with something different? I was like, I could not select the, I don't know how to select the position. Right. There are ways to do this in Bioconductor. I automatically connect with the databases. But I'm kind of partial to Ensembl. I think the Ensembl tools generally are, of all the tools that are out there, the most built with the philosophy to support programming and APIs and programmatic interfaces. So in general, if I need to solve something with a script, if I have to look at something just on the web, NCBI is fine, UCSC is fine. UCSC is actually very good, very rich. But if I have to do anything with programming, usually I get where I need to be faster with Ensembl. So let's try Ensembl.org. Ensembl.org takes me to a local mirror, US East Ensembl.org. We can search human for this thing. That all looks very, very similar so far. In the top window, we have the genomic region overview. In the lower window, we have the details of the 100,000 nucleotides we requested. And we can click on export data. And the output should be a fast A sequence. The location is given here. We can refine it if we need to. We can request five prime and three prime flanking sequence if we want any. We want genomic unmasked. We don't want repeats or anything masked. We actually want the 100,000 nucleotides that we're asking for. Now we can choose whether we want HTML or text, and we can even get compressed text if we're looking for a very large file. And now we have that. And if we download that one now, this is the file. This is an actual text file. So let's download this. We want to name the file chromosome 2100 kilo base pairs. Load it as a fast A file. And it appears here. And that actually is the actual nucleotide file. So far, so good. What format is this? Sequences come in many different formats. This is a fast A file. Fast A is kind of the workhorse of sequence files. It's a human readable format, which is good if you want to check and perhaps even edit things. How is it defined? You kind of look at it and you intuitively get an understanding of how the file is structured, but how is fast A defined? Because that's really important. If we want to write code that reads and perhaps introduces fast A files, we have to understand not just what types of files we commonly encounter, but what the actual specification for the format is. So if I ask a question like, how is this defined? And you don't know. Where can you find the information? Google for it. So Google for it. Where do you find information on fast A specifications? And Google tells me there's a fast A format Wikipedia page, which explains the format, has examples, and so on. So make this a habit. If I say something that you're not familiar with, raise your hand and ask what it's about, or Google for it, or find the information, you have to be active information finders, especially in this workshop. Now, there are, in fact, historically different versions of fast A. The canonical way of using fast A files today specify that there's one and only one header line. This one and only one header line is the first line of the file. This first line of the file begins with a greater than sign, or a closing angle bracket. And it has text data of any kind of text. Probably assuming assi text, not unicode text, because that's formatted differently. Okay? Right. So we went to the ensemble page of chromosome 20 with the requested coordinates. Okay, we'll go through that again. So we go to ensemble. On ensemble, there's a search field. We select human as the species we want to search with. And below that, we paste the coordinates that we're looking for. They kindly give us a formatting example. So this is chromosome 5 and rat. This is human chromosome 20 58,000815,001 to whatever. That's something I can type into my notes. Go to ensemble.org human paste the coordinates we need. And click on go. That takes me to the ensemble browser for the region. Now, if I click on export data, I get all of this verified again. So the location to export GRCH38, the output is a fast A sequence. The location is this and so on. And we want genomic unmasked. Once we're there, we click on next. Choose the output format. In this case, text. Right, but then you have to uncompress it locally. Right, because then it's a compressed archive. So if you click on text and you get the region and then using my browsers save as function. I download it to my project directory and call it chr20-100 KBP dot fast A. Typically, your browser will contain at that point mimimi. Do you want to call it fast A? It has to be a text or mimimi. It looks like it's an HTML page. So call it HTML. I don't know why they do that. If I tell it to name it fast A, please then just do that. You might have to rename the file if it doesn't arrive in the directory. So the end result has to be you have a file called chr20-100 KBP dot fast A in your local project directory containing these nucleotides. This is a small portion of it. Chromosome 20 is 10 megabases or something like that. So it's just a very small portion. Oh, actually no. We can see where we are here. Right? So this is the entirety of Chromosome 20 and we're picking out this little separation here, which is 100 KBP. So I think we have data. Usually that's the most troubling step. Once the data is on the computer we're safe. The data from the web into your project directories is often much more involved than you'd like it to be. So we're going through a few examples here just to practice that step. Now we said this is a fast A format. We briefly I briefly mentioned how this is specified and defined. There's a first line which is a header line which has header informations and after that we have sequence in one letter code. This is nucleotide sequence. We know because it looks like nucleotide sequence. This is not alanine glycine threonine alanine and so on and cysteine. But sometimes nucleotide sequence and protein sequence can be ambiguous. So that's one of the downsides of the fast A file. It doesn't require semantics. It doesn't require people to specify whether this is nucleotide sequence or protein sequence. Typically we recognize our four bases but they're also ambiguity codes. So for examples we can have R for perimidine bases or W for bases that form two hydrogen bonds and so on. You see that in this case the formula is uppercase. Fast A format also allows lowercase. And there's in principle one special character that is also allowed and often used in a fast A format which is the hyphen which is simply denoting a gap. That's really important if you store aligned sequences in a fast A file. And when you store aligned sequences often you put more than one sequence into the same file. This is something that we actually call a multi fast A file. This is also something you'll encounter quite frequently. So the file can contain more than one sequence and in that case it would have more than one header line. And each sequence would start with this one header line. So now that we have that the obvious next thing is to read this data into R. Right? So how do we do this? It's a text file and we want to read this information into R. Any suggestions? So on your console just type read and you will notice that there are many many many many different versions of read. Sorry, I should really unload some packages here. There's a read.csv there's a read.dlim there's a read fwf readbin for character for binary read character So which one would you use? How would you find out which one to use? Right? Well, R's help function. So for example we could type question mark question mark read this will give us information on many many many different R pages that all contain something about reading in particular information from other packages which are loaded and available for R. So with something very generic like read this may not be as useful. Can you just like hover over the reads and then check the information for each one and select the one that fits best for you? Right. So we can do that. Read.csv reads a file in table format and creates a data frame from it with cases corresponding to lines and variables to fields in the file. So read.csv is that takes a file name parameter and then as a separator for elements uses commas. This is a format that we use very frequently csv is for comma separated values or for example read delim it's basically the same description file now here the separator is backslash t what's backslash t that's a special character it's a tab So these are files that are tab separated often we store them as .tsv files tab separated values so csv files are files that we encounter often but we don't have separated values here we just have texts, lines of texts so as Pascaline said we scroll down and look and read binary no this is not binary data read character well no actually read character we might be able to make this work but it's actually for reading single characters from a text connection something like reading input ignore read fastay because that's a function I defined read lines read some or all text lines from a connection that kind of sounds like something we want to do all the text in a file so we define a file or a text connection and then continue from there let's see if Google agrees read text file into R read csv read text this is a different package data import and so on anyway if we browse along that we will find a number of different solutions now let's use read lines our file name is ch20.100kbp sure so when you're looking for a command for a so when you're looking for a function it will automatically complete for you but when you go to put your file name you can't set it up to do that where it will sort of automatically give you the options of your files I wish I've never maybe there's a way I don't know does it do that I think if you do read line for file equals and then the quote read lines I think in this case it's equals you see it okay I didn't know that that's so useful it basically also works in any kind of text string now if we simply execute that command though all this will do is it'll take the file and dump it onto the console that's not what we want to do we actually want to assign it to something so let's assign it to something let's just call tmp and assign to read lines and let's have a look at what we get okay now we've assigned this to a variable what's the next thing we need to do write the next thing we always need to do after we read something is we need to check whether what we have is what we expect so what would you check right but what what would you want to know about it whether the sequence is there and whether it's complete right so if we look into the environment pain this variable tmp has been created it's a character vector it has a thousand six hundred and seventy elements it kind of shows me what the first element looks like but we look at that in a little more detail so from that that kind of looks correct we have a vector we have sixteen hundred seventy elements each element in that vector is going to be a piece of text so an important command to to use is head and tail so head gives us by default the first six lines of a vector or a data frame or a matrix and that shows the first line as expected is the header line and then we have text going on what was that other command that I just mentioned to get the end of the line tail so tail shows me that you might have noticed the other line does not actually the last line actually did not extend to the end of the screen so that's good there's two extra empty lines here which don't bother us but we'll need to be aware of that once we actually do something with the file that there might be empty lines so that all looks from there that all looks good so this is a vector let's recapitulate a little bit what we know about vectors value right so there's a number of different categories there's data which are tables like data frames or lists or more complex object there are values these are assigned variables so basically vectors or scalars or single variables and there's a section of functions which have been locally defined and I still have a lot of stuff in my workspace I'll clear that out during the coffee break but you don't have any of that so under the values category you find the variable name TMP and the information about TMP let's say we want to see only the first three lines of TMP head gives us six we only want the first three how can we do that that's one possibility head takes extra arguments so if we give it the number of lines that we want to see here for example three we get the first three but more generically this is a vector how do we get the first three elements from a vector we can do this and get the three now this is very flexible remember there's different ways to access information from a vector one is by explicitly specifying numbers like a range like one, two, three or one three five seven or one, two, three but in reverse order and so on so we can access this in many in many different ways as a vector now is this a convenient way to keep the data or what's the best way to keep the data now when I ask about what's the best way to do something that's a loaded question because the answer is always best for what what do we actually want to do with it when we define what we want to do with it it's often clear how to store something or how to operate with it before you do that it's really an ill-posed question so without specifying what we want to do with the data this format of keeping it in memory is as good as any other because all the data is there it's not particularly inefficient it's just not easy to use for particular tasks so what tasks could we be interested in doing with a sequence file in principle finding short strings of sequence for example finding substrings what else could we want to do calculating gc contents what else variance variance variance finding particular positions for example finding dinucleotides all of these have slightly different pros and cons for different formats in principle there are two competing ideas one idea is to put everything into a single string one word of 100,000 characters and that's useful for example to find substrings because we can apply so-called regular expressions to that regular expressions are a very powerful and versatile way of finding patterns in sequences we'll encounter regular expressions later on before warned they're a bit salty but extremely useful so this is why we can't avoid regular expressions now if we look for a pattern like g a a t t c a format like this is really not good because our g a a t t c might break across a line and if it does that we wouldn't find it in any of the lines of characters so this is why in that case keeping everything in a single word is much preferable but if we have everything in a single word how do we calculate gc contents how do we count the number of g's in the string like this not easy in that case it is better to break the string apart break the sequence apart into a vector that has one character per element it's a vector of length 100,000 elements and keeping every element separately and then it becomes relatively easy to count single nucleotides and so on so that's what we'll do and for that we will write a function that takes a file name as an argument and returns the sequence as a vector with one character per element we will write a function to read fast-a files so it takes the file name as an argument and returns the sequence as a vector with one character per element and a good thing to do here would be to open the function template and edit that so we'll call this read fast-a.r and we will save it as so we have a local copy okay so do that with the function template build yourself save it as read fast-a.r and then we'll work on that so the general description is read fast-a sequence file the author is you I'm a little bit obsessive about keeping header files intact and actually adding the information you might think initially I'm never going to use that and I'm never going to share this with anybody or it just doesn't matter I can just dump the code into a file and it'll work just as well well in principle you may be right but things like that have a habit of progressing and accumulating more code and accumulating more intelligent things especially if you use them in a project and reuse them so if you start yourself out doing things with a clear header and a clear definition and clear comments everywhere you'll do yourself a favor and you'll do that person who's very dear to your heart a favor which is you yourself half a year from now when you need to revisit that code and you'll just throw up your hands and discuss by thinking at that time I have no idea what this code is and doesn't you'll need to rewrite it because it wasn't commented or you don't know what the value is or whatever so keep this thing structured it's a really really useful habit it'll save you a lot of time in the end and it also helps you thinking somewhat systematically about the code which is also really important so the date today is the that's the way I write dates and that's the only way I write dates why? there's a beautiful cartoon you know a boy and a girl standing together and the girl asks the boy what's your idea of a good date and he answers year month day everything else confuses me so why this way? universal yes that's always good especially in these times but there's actually a functional reason for writing it this way right if you sort things according to dates in this format if you sort them by text it will be sorted chronologically so this is why this format is useful okay now the function would be read fast A function parameters we've said we'll use only one file name purpose is describe read fast A file return a vector we might refine that we might not allow every letter but right now that's all we'll do at the sequence letters the parameters are fn vector constant which is file name of the input file the value is what the function returns r functions return at most one value they don't have to return any value some r functions are invoked for so called side effects something like printing something to screen or making a plot or changing something in the global environment these are side effects but pure r functions return a value and do not touch anything else anywhere r is what is called a functional language and functional languages work really really well if you write your function so they don't have side effects in particular we never change values outside of the function that's if you ever do find yourself doing that don't it's really really bad to happen could but you shouldn't so change only values locally and anything that comes in as a result is returned as a single value now of course we have vectors are these single values as a vector is a single value but what happens if we want a vector and annotation and maybe another function three different very very different things well in that case you can just combine everything that the function ought to return together in a list and return one list so in principle r functions return only one object but the object can be arbitrarily complex in this case the one value we return is character vector single letters sequence and in the details I describe some limitations for example we discard the header which we might not want to do later on we might want to store the header somewhere with our vector then we add the code and we return the result of whatever the result is now here's the thing r functions by default return the last expression or the result of the last expression that was evaluated in the function before it arrives at the bottom of the square bracket we can write that way I think this is very poor practice r functions that I write always return their results explicitly it simply makes it easier to read the function and figure out what's going on you don't just fall through through the end and then mysteriously the value arrives you explicitly return the value other programming languages also require you to explicitly return values so I don't see any benefit in doing it the idiomatic way in r on the contrary it becomes harder to read is possibly prone to misunderstandings possibly harder to maintain there are two other sections that I have in my function template one is examples and one is tests and we'll get to these a little later on in more detail now these are in conditional blocks and the block is if false then do something so what does this conditional block do what does a block like that do let's not think about why we do that just tell me now what does it do if I execute this block it does like reading the faster file no so this is a conditional statement conditional statements work they start with an if then there's a condition and then there's a block of expressions if the condition is true the block of expressions is evaluated if the condition is false the block of expressions is not evaluated so what's this condition here it says it's unconditionally false it's always false so if false does nothing this block skips everything that's within the curly brackets why in the world would we do that well the reason is if we have a function template like that we can simply source the entire thing and then load the function if we also want some additional information for example information of how to use it or tests that we don't want to execute every time we source the function so we put these in a separate block that is not actually executed so when we source this piece of code we define the function but we don't actually do anything in here so here is where we then write our example code if we manually go into the file we can always select them and execute the commands in here to experiment with our examples or run the tests change this to true whatever but the way we just write this function and then later on keep it in our our directory to be automatically executed we don't actually want anything to happen we want to be able to source the thing and skip all of that this is why we have a conditional block here that does nothing alright so the next thing to do is to write the code or is it? No we never start writing code what we do instead when we develop any kind of functions is we break up what we want to do step by step and simply write it as comments into our code so let's try to do that we want a function that reads a file breaks it apart assigns it to a vector and then returns the vector so first step is read the file second step remember it's a fast day file separate header let's discard the header discard the header then so in our example here we start with a vector of 1670 lines now after discarding the header we have 1669 lines so what do we do with these now it's one word of 100,000 something characters break it apart did we miss something I don't know this is a real question this is not rhetorical did we miss something Greg did we miss something you do it that way okay so time for a coffee break after the coffee break we'll actually implement this but in principle this is the template of our workflow to develop a function we first of all vaguely define what it is supposed to do then we think about how to break this down step by single step our steps as something we often call pseudo code like simple line by line instructions and then we implement the single instructions and then we verify that what we did is correct and then we test and write tests and so on so there's more stuff to it but this is the principle and we'll follow that religiously for every single example here this is the most important thing that I hope that you'll take home from this workshop if you want to solve something don't start writing our code writing code is the last step but be absolutely clear on what you want your code to accomplish