 Okay. So the first thing I'd like to talk about is environment and user interface. Now, this workshop is in a different format. I've never actually done it in exactly this way before. But I've been bundling all the materials into projects in RStudio and will make extensive use of RStudio projects. That has the great advantage that you can easily download the files from where I post them on GitHub. Everything is in one place, everything is included, hopefully, and everything should just work out of the box. But let's spend a moment to figure out where to put the project files to keep things organized. So I'd like to talk a little bit about organizing your hard drive and organizing project directories. I've put some slides here on Google Slides and there's a link here. If you want to follow along on your own computer, that link should work for you. So there's many ways to organize your large hard drives on your computer. What I find works really well for me is not to organize things by things like post-doc work or specific locations I've worked in, but by themes. Because I usually find, if I organize by dates, the project takes significantly longer than the date that I've anticipated and then my folder name isn't really relevant anymore and it's much harder to find. In terms of themes, I find over the years that as remembering things gets more difficult, I still am able to keep up with the things that I do need to remember. So for example, you might have very top-level teams like current work that you're working on, administrative work of which we all have too much to do, research work which we'd actually like to do, some things we need to take care of at home, resources of a more general nature and so on and so on. And if we are particular about keeping these in a particular order in our directory listing, I like to prefix things with just numbers so that they're listed in a good way. One of the advantages of the numbers is that on the terminal level, I don't usually need to type out the top level directories because I know what the number is and then I can just use tab to complete them once I put the first two numbers in. So it's fast. This folder current or something like current, that's something I increasingly use, I just put alias to the actual work that I'm doing at the moment in there and I have my main files in their hierarchy where they belong but I also have them immediately accessible as a shortcut. And if you have something like research and computational biology and training, that might be a folder where your workshop materials can go into, something, you know, thematically you might have, you might distinguish between training, your actual projects and again resources of a general nature. Now I think you probably all have some kind of a training directory, CBW directory or whatever you've called it. If you haven't, it would be really useful if you make one now in a location where you can easily find it. Because we'll be using this directory as basically the staging ground for all of our studio projects. So at some point you'll have a home directory, the first top level directory that your computer uses. On Macs, this is basically the top level home directory. On Windows computers, I'm always confused. You have to apologize. I know very, very little about Windows computers at all. So when you, there's probably a C colon slash and your username or whereas a Windows computer usually, what does it consider as your home directory? Anyway, you can figure it out. We'll find out how to figure it out in a short moment. Now, a training directory could contain, collect workshops tutorials and other training resources. When we download a project and create a new project in our studio, this is the folder or directory where the project is going to be located in. And then everything that belongs to the project is going to be subsequently put into its own folder within that directory where it remains accessible. So let's give that a try. Let's install the first project. I'd like you to open our studio and basically we'll be working only with our studio in this workshop. The functionality is 99.9% the same as R itself. Our studio is essentially a graphical user interface for R, but it's also more than a graphical user interface. It's also something we call an IDE and integrated development environment because it supports writing code and working with code. So open our studio probably should look something like this. Now select file new project. I don't want to save. There are several ways to create a new project. The way that we work with projects during this workshop is to use version control and download files and assets from GitHub. So we click on version control and we select clone a project from a Git repository. And then we want the repository URL. So you can copy and paste this. This is the repository URL for the first steps here. And next we need to find the directory which I've just previously called the training directory. And we browse for that and click on open to have it selected. So this is the sub directory in which the project is going to be installed. And this middle name of the project directory name will auto fill if you just click into it or use your arrow keys to go back into it. I don't know why it doesn't automatically auto fill when this is first copied into the field here. So now you should have all these three fields populated and then just click on create project. And that's what it should look like. Being able to work with version control on your computer is really, really essential for your own work. One of the big problems that the age of large data has brought upon us is the question of reproducibility. You have to publish something and then your data changes and how are you ever going to reproduce what you've done previously? So one of the strategies to solve that is to extensively work with project scripts, our project scripts, and our projects that keep everything bundled that you're currently working on keeping basically running everything through a script, not using the command line, not using point and click interfaces, but making sure that the entire analysis that you're doing is reproducible from the first line of the script to the last line of the script. Not in the least because you're going to submit your manuscript and then the reviewers are going to want some change and then you're going to scratch your head and try to figure out how the hell did I actually make that plot and spend a lot of time on that. So if it's all bundled in the script and you can just rerun it, you'll save a lot of time. But importantly, you'll also be able to reproduce things. However, of course, if you change your script and then something breaks, what are you going to do then? And this is where version control comes in. If your script, as in these projects, is under version control and you commit your changes regularly, you can go back in time to a previous working version and recover or to a previous working version that you've published previously and reproduce the results that you got at that time. So version control is really essential in the digital age. We need to be accountable for our research and we need to keep essentially a paper trail of what we're doing. This is why we're working this way. Other than that, of course, it's a very, very convenient way, especially working through GitHub to share code, to share materials with everybody on your team or in your lab. And this is what I hope to be introducing you to you today. How convenient this can be to have everything packaged and then you can just download it from GitHub. Of course, only if it actually works, so it will work out how to get our computers to do what they ought to do. Now, once you open our studio with this project, you'll see three or perhaps four open panes. These are panes like in window pane, not pane as in frustration. And in these panes, you have different tabs, so I will be using the words panes and tabs to refer to this. This is the console pane where we have an interactive console into which we can just type our commands and have them executed. This is the environment pane and there's several tabs here that allow you to look at what I've done previously or what functions and data sets are currently loaded. And this is a supporting pane, for example, that lists our files that has plots if any are generated, that lists the available and loaded packages within R and so on. Oh, and importantly, also the help functions. Now, here's a bundle of files now that we've loaded to support this. Git ignore is something you don't usually need. It's simply a file that tells the program Git which files should not be under version control. Things you don't want placed under version control are large binary files. Version control works incrementally, but it only works on text files. So you have a text file, you change something and Git remembers the change and doesn't simply store the entire file again. This makes it efficient also for large files, but it can't do that for large binary files. So if you have your original reads in a BAM format and you change something there in some way and you upload that, you will need the entire file stored over and over again. So by editing Git ignore, we can just tell it things that should not be under version control. So things I don't place under version control is our history file, our project file and some other things. But this is automatically done and already supplied. Now when the project starts up, the first file that gets executed is a file called .rprofile. And that's important. This file can contain all of your configuring information. So for example, if you want an initial function loaded here, you could put it into rprofile. Now I needed to circumvent a little quirk of our studio in order to enable us to actually load the r script which we are using throughout this workshop or this part of the workshop. Typically, I would simply put a command file.edit, the file that I need, into my rprofile. And then when r gets loaded or our studio gets loaded, it should open that file. However, it doesn't actually work because our studio uses its own editor, so it won't recognize file.edit early on. It's an unknown function for our studio. So there's a detour here. Within our profile, I've defined a function which simply sources an r script. So when r starts up, this r script is sourced and you execute that function or you are prompted to execute that function. So this is why the console asks you welcome and type in it to begin. So we type in it and then the function is loaded and that function now loads my script file, rintroenvironment.r. This is the script that we'll be working with. Maybe a later version of our studio will at some point not require this workaround, but for now the workaround helps us open this file. Okay, now these script files are meant to read along and type along as we go through this. This is your script. Put your notes in there. Put comments in there. You don't have to comment out things, but you can simply type in them. Since you are not sourcing or executing that, that won't generate errors. But put copies, notes and questions to us and all information that you need into these files. If anything changes in that file, this turns red and at some point you can just save it in the normal way. Okay, so let's talk briefly about the graphical user interface or the integrated development environment. So I've told you about the four pains. One thing that can be very useful and help for quick and correct coding is command completion. So what does command completion mean? It means you can type part of a command and our studio will give you a number of choices how that whatever you type could possibly be continued. So if I want getwd for getworking directory, I get all choices that begin with get or constrain them if I type four characters. And then I can simply click on that to complete the command. This is especially useful if you're not entirely sure whether the r command is written in camel case or in dot format or even in pothole case. So camel case is words that are strings of lowercase characters and in between them have some uppercase characters to separate the words. The dot version is with dots in between and the pothole case is with underscore in between. Unfortunately, r is really, really inconsistent about what they use is read lines in camel case or with dots or with pothole case. I don't know, but if I need it, I see lots of versions with dots and I see lots of versions with camel case and read lines that I was thinking of is one with camel case. So it's comparatively easier to find the correct version. If you're typing something and you're typing it partially and it doesn't pop up here, it's unknown to the system. So you've probably mistyped something. Another great advantage is that you immediately get a synopsis of the command syntax. So essentially the parameters that you need to add. So getwd, for example, does not have parameters. It's a function that simply gets the local current working directory. However, if I say something like setwd, it has one parameter dir for the directory. Is this all too small for the people in the back? Okay, so yeah, did you all type getwd? What does it say? Is it the top level directory of where you actually place the folder? Okay, this is amazing because something that really seems to work is that when I save the project, of course, when I write the project, it's all relative to my own hard drive. So how does it know about the structure of your hard drive? And I've defined to set the working directory to the project directory. And apparently it remembers that well enough to make that stable. Okay, so the working directory is automatically set to the project directory. You are aware of the difference between home directory and working directory? No? So the home directory is the one where you log in. The working directory is the current directory that our uses. And if it's set to the project directory, it's the directory that contains all of these files here. So here's a little task. Type sys in the console, get a list of system-related functions. Have a brief look at what system-related functions there are. Find sys.time, use the tab to auto-complete and execute it. So Andrew just mentioned something that's important for you to know. A complete function command in R requires the open and closing parenthesis, even if there is no required function argument. So if we say getwd, this has the expected result. If we leave out the parenthesis, that interprets this string here as a request for getting the contents of this R object. R functions R objects. And if I simply give it that name, I get displayed the contents of that object. So for getwd, I get information, this is a function. And I can't actually get the contents of that function because it's a compiled function for which I don't have the source code available. If I use the init function that I had before, like this, I get the actual function code that the init function is defined with. So if you actually want to execute a function, don't forget the parenthesis. And if you get something funny, then you did forget the parenthesis. Just add them. Okay, so what does sys.time do? It gives us the current time. So if you ever need to change your working directory, the syntax is the path to your directory. Note that this requires a string of the path to your directory. If your directory has blank spaces in the directory name, you have to escape them, i.e. put a backslash in front of the space. If the directory is correctly set, then this command, list files, should list the files that are in our main directory, in our working directory. Do you notice a discrepancy between this list and this? Do I get the same files? Am I in the right directory? Some files are omitted. Yeah, what's omitted? Everything that starts with a dot. In Unix environments, everything that starts with a dot is considered a hidden file. So it doesn't normally appear in directory listings. You can tweak your computer. I'm sure you can also do this with Windows computers, but you can tweak your computer to always include these files if you're programming. You often need the so-called dot files too. But even if they're not listed, you can always access them. So for example, I can say file.edit.r profile. Even though our profile is not listed in this list here, I can access it, open it, and edit its contents if required. So if you know the name of a file, you don't necessarily need to see it in a directory listing. Now, if you define paths in Windows, normally you would use backslashes. And however, on a Unix level on which R and R Studio are built, the backslash has a special significance. The backslash is a so-called escape character. So if you have a backslash t, this doesn't necessarily mean there's a backslash and then a t. It means a tab character. If you have a backslash n, it means a new line character, and so on. So expressions of paths in Windows using the backslash, as you normally would use them in Windows, don't work in exactly the same way. However, conveniently, R will translate this for you. If it recognizes that you really mean a path here, then you can simply use the forward slash and it will translate this into a backslash as it sends out the system command. So even though you're used to backslash syntax for Windows computers, for our purposes, you can always use forward slashes if needed. In my script files, I usually write setWD, whatever path is the first command, just to make sure that I'm in the right directory, which often, otherwise, I'm not. Now, small things about making your life easy when typing in the console and the editor. When you type a quotation character, quotation mark, RStudio automatically complements that with a second one and places this in the middle, places the cursor in the middle. Similarly, if I use single quotes, same thing. Similarly, if I use parentheses, square brackets, or curved braces, all of these are auto-completed to a second character and the cursor is in the middle. And this is really convenient because usually, when you start typing a function of sorts and all your commands have to go in, if you don't have this usually, then at the end you forget having a curvy brace and then you're in for a session trying to debug where you forgot to actually put the brace, which might not be obvious. Now, one problem is, though, that if I type a command, or if I have a multi-line command that I execute and I haven't closed this, say like this here, and then I press return, it expects me to continue this and it doesn't stop. This is especially often the case if I just copy some code from somewhere and I forget to include all the balanced braces or brackets or quotation marks. In this case, I simply type escape and that escape gets me out of this particular sand trap again. And one more thing that is useful when working with the console and the editor is the history function. Within the console we can use up arrow and down arrow keys to go through commands, but the entire history of a session is also recorded in the history tab of the environment pane where all the commands are. Now, this is useful because it's not just there for reading, I can actually access these commands and execute them, but it's up to you to figure out how. Just give that a try. I presume at some point you've typed sys.time, so access the history tab and execute it again. Double-click on it and this loads it in the console and then I just simply press return to execute it again. Any other options? Right. I've never actually done this. I see this for the first time, I've always learned something new, but apparently I can click to console. If I use shift, I can select more things. Now you can save your history and by default R would save your history and then reload it the next time you open a project. I always turn this off, but I'd like to show you where it's turned on and off. I always turn this off because I believe that the script itself should be the history. So everything you do, everything that's meaningful should actually be contained in the script and I don't rely on a second file that can go out of sync in some way with the script to contain information. So I never load environments when I start it. I like to start with a blank environment, no functions defined, all the functions should be in the script and I don't load history, the script itself should be the history. If you ever want to change that behavior, which I've turned off for all the projects that we use in the workshop, go to project options and this says restore our data into workspace and you can either say yes or default, save workspace to our data on exit or always save history even if not saving our data and so on. So basically our data is a compressed format that contains all of the current environment, your currently defined variables and data sets that you've loaded in memory, they would be stored with your session. But as I said, I've turned these off. Okay, using scripts. Now, you can simply type R commands on the console and this is convenient because you can work through history but it's much better to actually use scripts and I strongly believe all your R work should always go into scripts and ideally it's probably going to be useful if all your R work always goes into projects. So, oops, this is wrong. It's not the assets folder, I've deleted this. This project contains a sample script and that is script template dot R. This is a template that I usually use to write my own scripts and it's a bit wordy to begin with but if I discipline myself and I actually fill in the information there I find life really rewards me for that because there's just less stuff I have to remember and it's easier for me to figure out especially if I do some development here, some development there things can become very confusing and it's much, much easier to keep track. So, a little digest of what this is for a little note on a version that's actually quite important because you're going to develop things and refine them and if you keep track of the different versions it's easier to make sense of what state of your development you were in when you were working with this date and author, especially author if you want to share your script with someone so they know who to blame for errors input, output and dependency. So, input is what you need to even run this output is what you hope to get out of it dependencies are things like packages that you need to load or specific resources that you need somewhere on your computer you might make a note of things that you still want to develop later in a section of to-do you certainly should make notes of things that don't work as expected or as they should and put them in there as the first command usually set working directory to your project directory especially if you're not using an RStudio project then I split things into parameters, packages and functions parameters, these are all the numbers and file locations and other information that you need to run this in the first place so these parameters should go here and they should have a little comment it's very very confusing if you have magic numbers so in your code below when you write your code maybe the number 7.1 appears and three weeks down the line you will not remember what that 7.1 was about and whether it's even the right number so you put something like that up here and say scale is 7.1 just to make something up so in this way you keep a record of what you're doing and why if you don't put information like that into a common and convenient spot nobody's really going to be able to understand your code when you pass it on to someone and the worst case always is if the person you pass it on and who can't understand it is yourself half a year down the line so embarrassing, it happens all the time so try to work against that and really to re-run the script with slightly different parameters this section should be the only one that you ever want to edit now packages there's a very very large number of additional functionalities for R which are available through packages and I think you've encountered packages in the introductory tutorial we'll be casually loading packages and working with them as we go along the paradigm which I usually use to code for packages in the script is the following if the package is not installed obviously I want to install it after it's installed I still need to load it with the library command so loading a package has minimally one but possibly two components to it the minimal part is to issue the library command to actually make the package available but the package may not yet reside on your computer so you might have to install it with this here note a frequent source of confusion is that the package name here is a string i.e. it's in quotation marks but here it is simply the label of an R object and therefore requires no string so install packages and library of the package is something that usually goes into our scripts now I don't want if I basically run the entire script all at once I don't want to go back and install and reinstall the packages all the time so I use this little construction here and there is a command that's very similar to library and that command is require but that command has an output value and that output value is logical true or false so if loading the library is successful require will give me a logical true if this command is not successful it will give me a logical false now why would it not be successful well obviously the most common reason is that the package was never installed on your computer so what this says is if the command is successfully loaded the output of this command is true and this is the logical negation operator so if not true then go through installing the package however if it is true that require has worked then skip this whole skip this whole section and then it's just installed the string quietly means don't give me any feedback about what you're doing but this only refers to when I'm actually sourcing it so quietly will prevent you from having to go through lots of checks in your output when you source the file or when you run the entire script all at once when you run this interactively it doesn't work in quite the same way you will still get this function to complain that it wasn't installed in the first place so if I simply run this function here our unit apparently isn't installed so it installs it and loads it and then it gives me the warning message that there is no package called our unit well that's actually no longer true at that point there's nothing that you have to be aware of at that point it was installed and it was loaded but after it was done doing that it basically went up the stack and said what was I doing previously I was previously trying this require function and that gave me a warning and now I'm going to print out the warning because I have time to do so so by the time this warning comes in it's no longer current if I go through this particular paradigm of loading packages be aware you may get a warning which doesn't mean it wasn't successful can I go through that again? so basically if I go through this here the first task our needs to do is to execute this require function i.e. it tries to load the library if it notices it fails doing the library it goes to the outside of this if statement and installs the package and then explicitly loads the library by that time the library is loaded so it goes back to the outside the containing command and the containing command then says well the last thing we were doing created the warning so let's just put out the warning now so it deferred issuing the warning immediately when it encountered the warning so that deferred warning then appears however by the time it appears it's no longer current because not only is there a package we've installed it and loaded it by that time now quietly is supposed to suppress this but it only does so when I run this automatically in a script when I run it interactively the quietly doesn't work in the same way so I still get the warning printed and then you see this you might think something went wrong but in fact everything is okay I can go to packages and I see that our unit is installed and loaded in the next section of my script template I define functions in the next unit I've created a separate file for a function template but in this section you would source external files or define functions nothing actually happens here except that you go through code that makes functions available and we'll be working more with functions as we go along so a typical function again is commented it's purpose, it's parameters and the resulting values and then comes the actual process the step-by-step process of your project where you go through your commands one by one and run them and load data and transform data and create plots and write output and so on there's a little section for function tests testing is really important but that's all I'll say about testing for the purposes of this workshop if you're interested more in how to test and why to test there's a very very good supporting series of computer and programming literacy called software carpentry software carpentry has versions that are taught in R and versions that are taught in Python and they talk a lot about version control why version control, how version control testing and so on all the little things that you need to ensure that the work that you're doing doesn't just run but is hopefully also correct and reproducible now instead of typing things in the console I type everything into my script with that then I have a record of what I've been doing I can go back easily and refine things and change things and rerun them but of course if I would type things here and then have to copy and paste them into the console that would be very awkward and hard to work with fortunately there's a different way we can select things and then execute them so if you place your cursor into a line and then press command return on the Mac or control R on Windows computers and I hope this works on Windows computers as I said that particular line is then executed and if I select a whole line again everything that's selected is executed now this is really useful because often R is a functional language and often we take results of one function directly as inputs for another function so the directory function in this case gives me a directory listing of all files that end with dot capital R and I pass that directory listing to the length function so if I want to unravel this and look at the nested structure here I can simply select the inner part press command enter then only the inner part gets executed so in this case this produces a vector of three strings i.e. the three filenames that start with this pattern note that the pattern expands to R project it's not only R files proper that have been used here and the length function then tells me that this resulting vector is three elements long so I have three files which match that pattern if you want to select more than one line you can select an entire block you've actually seen this previously when I used the if statement to load a package so this entire for loop here is then executed in one go okay now I'd like you simply to be able to work with files and make a copy of the script template in your R resources directory if you don't have an R resources directory it might be a good time to create one so where you keep things that you generally often access like a script template like a function template like this file here a pdf which is a short reference to R commands so things which you know thematically belong together as supporting your work with R and one of the great things about R is that it has an enormous scope of possibilities and of functions that you can work with this is also one of the downsides of R essentially the problem is that there's so much to learn about R that it's pretty much absolutely impossible to remember everything about R that there is to remember by the time that you've essentially studied say all the programs on bioconductor that have something to do with large scale genome analysis so many new files have been added and so many new options have been added that things become very difficult so there's a few ways to work around that the most basic help file is just by simply typing a question mark and the name of the command that you want to help this is equivalent to typing help and then the command in quotation marks but this is a short hand for the same thing this opens the help file in this assets pane here and it tells you the proper command name it's correct parameters this is then explained here all the arguments or parameters are defined and most importantly it also tells you the return value i.e. what that function produces now you'll learn very quickly when you work with these that they're correct but they're not very helpful often the text of these command files is very terse and very technical I always see a little bit of a paradoxical disconnect there they're not written for the people who actually need the help this is programmer's documentation but if you understand this is a relatively good one but many of them here on a POSIX file system recursive listings will follow symbolic links to directories this is probably not what you were wondering about when you open this command so this may be the first and quickest way to get some idea or simply to fill in some blanks but it's often not that useful first of all it's specific to the exact spelling and secondly it's very technical if you're not quite sure about how that command is named you can search for entries with that string and the search results then gives you a number of options the package where this particular function appears and what it actually does there's also this option here the apropo function which finds all the files that start with dir so you can put patterns here so for example these are things that start with dir if I omit this I get things that include dir somewhere so for example this here dir in the middle if I put a dollar sign at the end I get functions that r knows about that end with dir so this can be helpful if you say was there something with csv files somewhere in r comma separate values isn't there something that reads them oh there is right so there's functions both to read and to write csv files a rather large package or a rather large amount of information is available through the sos package essentially what this does is it finds keywords in all r packages not just the ones that are available on your computer and it opens them in a tabular view in your browser so simply try that at some point I've put a number of other help functions and help options at the bottom of this page here useful resources oh I should refine that so it actually links to that so there's a number of things that are really important so if you have a technical question about r the r help mailing list is very very responsive many people are following that mailing list and usually if you're not quite sure how to code something how to solve a problem you will get excellent answers in very short time whenever you request help on a mailing list however if this is going to be a pleasant experience for you and for those who answer there's a little bit of homework that you need to do and that little bit of homework is you need to structure your question and ask it in a way try to imagine that somebody is asking you that question and ask do I actually have all the information I need to answer that question ideally you would construct what is called an MWE a minimal working example so don't upload your BAM files but make something very small with maybe 10 lines that produces the problem that you're encountering and that people can reproduce on their own computer if they're able to go through 6 or 7 lines of code and reproduce the problem you will find they go into puzzle mode and they will drop whatever they're doing currently and have a beautiful half an hour of procrastination and find the most elegant way to solve your problem for you with everybody else in the community to see so the peers can then congratulate them and say it's a really nice solution and well done and incidentally this is also a great way to learn and to keep current with issues of our own so this is the our help mailing list and stack overflow works in a similar way the key to get good answers is a minimal working example and spending some time and asking yourself whether you've described your problem well enough so if you're simply looking for things I find these days I really don't need any other tools but Google I just Google R and then whatever and Google seems either to have learned from my browsing habits that when I say R I mean the R statistical workbench and programming language and it's so rare these days that I need to go to the page of Google results before I find what I need it's amazing if you have a legitimate question chances are 99.9% somebody else will have had that question before you and somebody will have answered that question there's a specialized search engine for our topics and that's especially useful if by chance the term that you're searching for brings up large numbers of irrelevant hits which is possible if the function is named something that also has a colloquial meaning somehow you can get tons of irrelevant hits then you can go to Rseq which is a specialized search engine only for our topics httprseq.org most packages that we use are either downloaded from CRAN the comprehensive archive network at that link or from Bioconductor Bioconductor is a large collaborative project that writes and maintains computational resources are related computational resources for molecular biology especially large scale genomics and proteomics and one thing that's worth browsing over is the Bioconductor task views or the CRAN task view collection where the large amount of functionality that R and Bioconductor provide is made available in a thematic arrangement so if you want to know something about hypothesis testing or non-linear regression you are possibly going to be able to find good resources there so these are just some links by default just google for your answers and that will usually help so with that I'd like to leave you for your coffee