 Francis wanted me to share these notes with you first of all just to thank the bioinformatics workshop but also to say that all the lectures are going to be made available so it's going to be up and access you can share them and everything but you've got to make sure that you follow these rules if you do so. So today the well today and tomorrow the workshops going to be about exploratory data analysis and especially doing that using R so you're going to learn a little bit about R and also a little bit about statistics and in particular something that I really like and I think it's very important in data analysis is what I call exploratory data analysis. So I would say right away that you know this workshop you're going to learn a lot but I mean I don't want to fool you and you probably shouldn't fool yourself after a couple of days you're not going to be an expert in statistics nor you're going to be an expert but hopefully you will be aware of statistics you will know what it is you will understand the main concepts you will understand that is very important and you will be able to use R and maybe learn R in a better way so I think at this at the end of this workshop what you will have learned is that R is a very good tool it's not that difficult to learn and statistics is very useful and it's not that difficult either and hopefully this is going to push you to try to take more statistics courses or learn R by yourself. So as I say the goal will be first of all to try to display statistical information properly and this is in particular using exploratory data analysis then understand what the main basic concepts of statistics are so such as what is a p-value should I do a two sample t-test or a pair t-test do we need multiple testing what is multiple testing so we're going to touch on all of these issues probably we'll go over all of this today actually and also you're going to get a first exposition to the R statistical language so R is a really nice language it's free it's open source I'm going to say a little bit more about that and it's not very difficult to use even if you don't really know how to program or you've never used a programming language before so here's the outline for today first we'll start with the basics of R and we'll go on to some very toy like examples we use R you will use R and we'll see how we can deal with a couple of things that we do then we're going to try to so all along the workshop we'll be using R but then we're going to try to use R and trying to do more statistical analysis and first using all the graphics methods of R using exploratory data analysis then we'll look at one and two sample t-test multiple testing bootstrap and tomorrow morning we'll talk about multivariate exploratory data analysis using what I call singular value decomposition of principal component analysis which is the same thing okay so before starting I wanted to tell you a little bit about statistics and why it is important so if you if you look at the New York Times probably in the past six months you will see that there was a couple actually three really nice articles that show that statistics is very important the last one to get published was the net the Netflix context so if you guys don't know about that that that contest basically what Netflix wanted to do is to try to improve the the the user rating and the categories of new films so that you can make better recommendation for the people to rent movies so what they've asked people is to come up with statistical metal models or statistical methods in trying to improve these recommendation and these ratings and there's so I think I can't remember exactly but I think you had to improve it by at least 10 percent over the current status of the rating and recommendations and there was a one million one million dollar price for the team who could actually improve that rating of 10 percent and just a week ago the the price was announced and there was a team who actually won that that same team actually tried a couple years ago and they couldn't improve it by 10 percent they could only prove improve it by some percentage and they really want some kind of price but this year they actually managed to do it and guess what this team was actually made of several statisticians and one of them is called Chris Valinski he was a PhD student at the University of Washington and work with Adrian Raftree who was also my advisor at the University of Washington so and there's really many real-life examples where you can see that statistics is very important it can really helps you a lot in doing things that probably people couldn't do without here's another article I don't know if you saw that one so this one actually I think it was published maybe I can remember I see the date but I would say maybe three months ago or so so this was about a graduate student in archaeology and when she actually finished she took on a job at Google and because you know a lot of people think that in archaeology you just go in you know and in some places and look for bones and things but in fact you deal with a lot of data and a big part of the job is actually to do data analysis and so this was saying that statistics is really going to be very important it's going to be playing a key role in many of the fields that we're going to see in the next decade and not only biology and bioinformatics but lots of places we're trying to gather lots of data and it's very important to be able to make sense of these data and this also tells you that you know if you've got a PhD in statistics I think it says somewhere in the article you can make a hundred and twenty five thousand dollars a year if you work at Google so maybe I should work at Google here's another interesting article this is about our again was published in the New York Times and this is what this is so this this was to say that basically ours a great tool it's open source it's open access it's free you can download it and it started kind of a I'm gonna give you a little bit of history and our but it started as almost like a fun project you know like a toy project you know so this is Robert gentlemen and Ross he hacker they were the UNC of New Zealand and they started to work on that project just kind of like for fun they wanted to derive a tool that their students could use and it's become so important and that so many people use it that they probably never imagined that that many people we use are and many companies now they use it such as Google, Pfizer, Merck and so forth and there was a nice article to try to summarize of that in the New York Times so if you if you've never read these articles I really encourage you to do that it's very very interesting so these are sort of three examples just to show you that our statistics are very important in in today's life okay so let's see a little bit of history because a lot of you have heard of our but maybe you don't know too much about how it comes from and for you know how long it's been around and so forth so this is a nice sentence that probably doesn't tell you very much R is the son of S well this is great I don't know what S is well S is a statistical programming language developed by John Chambers from Bell Labs so and this is actually a sentence from John that says the goal of S was to was to turn ideas into software quickly and faithfully so the idea was to have a programming language that will be easy enough for people to use without knowing too much about programming and you could deal with data and data analysis very efficiently so S was actually created in 1976 but at that time you could only run on a specific operating system it wasn't very friendly you couldn't do very much with it I mean as you can imagine it was created at a research lab and was mainly created for their research it wasn't really meant to be a mainstream software but it turned out to be very important especially for statisticians so in 1988 the S language arrived and it was actually introduced many changes compared to the original S language such as you had functions and you could use it on many operating systems such as unique servers and so forth and there was also the famous blue book the blue S book that people use a lot it's a good reference there's a lot of good things about S great so version 4 was introduced in 1998 again probably doesn't tell you too much about version 4 but just to say that this is sort of so R is based on that version of S so if you write something in S version 4 it will work in R you can probably just copy and paste it and it will work and it was introduced as a formal class method model I'll tell you a little bit more about class and methods as we go along though it's not it's a little bit out of scope for this workshop so this is the bad news about S is that in 1993 a company called StatSide which are the makers of S plus acquire exclusive license to S so this means that after 1993 if you wanted to use S you had to pay a license of course it was sort of a company so they made a nice interface you had a GUI and you had a full customer support and the company who was actually created in before 1993 which what was called inside fall was actually created by statistics professor of the University of Washington called Douglas Martin and now I think S plus was bought by another company making spot fire just a couple of years ago okay so what about our so all was actually created by Ross he hacker and rubber gentlemen at the University of Auckland in New Zealand and the goal of all was to create a statistical language that first of all will be free open access you wouldn't have to buy a license will be easy to use for their students when they would teach statistics and actually so when people started to so there's a lot of stories about rubber gentlemen so at that time let me see if I have the date when he actually started yeah started in 1991 so at that time rubber gentlemen was sort of talking about our he was already coming to Canada quite a bit and he was sort of saying to people oh yeah you know I've started that new project we're gonna just rewrite S and start the our language and people would just say are you crazy or what why are you doing that you know there's the S language and S plus this is such a waste of time you know trying to rewrite a language from scratch so people didn't really believe in it but it turned out to be a great idea because it was free and open access and he actually make the fact that it was free and open access made it much better than S plus and I'm gonna tell you why so it first appeared in 1996 as an upper source software so at that time was still a little bit rough I mean you could use it it was nice but there was a lot of things that could be improved the fact that it was free and open source it made it highly customizable via packages that is people could just write packages and contribute to the actual software the actual software so the power of ours that it is based on a community you can collaborate with people you can write code that people use you can package your code into packages that are freely available there are places and websites where you can download these packages everything's free and it's also you always have the the state-of-the-art statistical methods is because people contribute to our actually researchers in statistics or the field so you're always gonna get the the best of the statistical methods you can get of course there exists also commercial variants of all that I've built into our but they are just companies that will sort of package on a nice way and they will sell you some customer support with it but it doesn't actually change our very much so probably to are there's another project that you guys are probably heard of it's called by conductor so by conductor was actually studied by rubber gentlemen so even though rubber gentlemen started or he's not really involved in the year core development of our nowadays there's a good team who actually work in trying to make our even better improving are doing all these changes and rubber gentlemen as sort of switch focus and now he's well he was working by conducting until very recently he's still working a little bit on it so by conductors actually based at the Fred Hutchinson Cancer Research Center that's in Seattle it's a nice collection of packages for the analysis and comprehension of genomic data so it goes from micro microarray data to high throughput sequencing through flow cytometry high throughput content and so forth because all is free so is by conductor it's open source and of course it's open to outside contributors and that means it's open to you guys if you guys want to contribute something you can the difference with ours that there's a little bit more that there are more standard so when you submit a package to by conductor people will actually check that the package works it installs properly there's a nice documentation so there's good things so there's a little bit more things you have to go through when you want to submit a package and by conductor okay I just want to say one more thing is that even though rubber gentlemen started that that was at the Fred Hutchinson Cancer Research Center to rubber gentlemen actually left the hatch he's now a genentech in San Francisco he studied a computational biology research group there but he's still very much involved in by conductor is still up and running there's a lot of people at the hatch that work on that and the person who's in charge of the project now is called Martin Morgan okay so what is our and why is it a nice platform or software to deal with statistical analysis well the first one is that it's easy to handle data and to store data in our it's it's nice to do calculation and by that I mean mathematical calculations using matrices and so forth just like any other mathematical software package of course it's better than that because it's more it's geared towards statisticians so you're gonna have a lot of statistical functions that you can use for data analysis that are really built in and we're gonna see some of that you also have great graphical facilities for data analysis and you can either display that on the screen or you can make PDFs or various things that you can actually use for you know research papers reports whatever so I mean I really encourage you in the future once you've you know done with this workshop and you're gonna be able to graph a few things in art you really use our for graphical display and a tool to actually show the results that you have hopefully after this workshop you'll know that ours a lot better than Excel and you can actually make very pretty graphics with art even better than with Excel and I think also the key point of ours that it's simple and it's rather effective programming language that is even if you don't know too much about programming you're gonna be able to use okay so that's it for the reference and the history I would like to give you a few references as well so this is a very nice book an introduction to statistics have been using that quite a bit I've copied a couple of the examples in the lecture today so if you want to know more about our and if you want a good book this is this is a good references never expensive either there's a nice so sometimes you know it's it's you're using R and you sort of forget about the comments and you know you would like reference cards so there is a reference card available from the our website so you can download that and you can just have it besides your desk and every time you're working you can look at the comments that you know you don't remember this is nice tutorial that you guys should have looked at already there's a couple of nice examples so whenever you're going to use that state was more for you to play with it and of course there's a lot more sources from the our project website and bio conductor