 Can you lose your line if it doesn't work? So hello everybody. So my name's, I've started the recording show, I did, FYI. So we have the usual CC license at the beginning of the talk. In addition to that, I encourage you to copy, share, take pictures, blog, tweet and so forth as long as you acknowledge where you got it from. So this is module 2A about cloud computing in Amazon and same type of disclaimer as John had and my contact information and the tags for those people who use Twitter and the learning objectives for, oh, point two, learning objectives for our lecture today. So to introduce you to cloud computing, we're going to use the Wiki for the workshop as Michelle just mentioned. How to log into the cloud, review databases using bioinformatics and look at some cancer genome data and spend a bit of time on UCSC and ITV. How many of you before you registered for this workshop had never logged into a command line prompt? Don't be afraid. Okay. Just a few, two mid-hands, okay. This is good. Well, you'll become pros after this, so this is very good. So just a bit of background why we're using the cloud and John sort of alluded to it. One of the growth curves of UCSC and this is presented and a paper by Lincoln Stein who's the director at the OICR for informatics and a sort of a long-time bioinformatics guru, although he predicted the term bioinformatics would disappear about this year. So some gurus are wrong sometimes. Anyway, so this is sort of the growth curve of data before next-gen sequence data. This is the growth curve of hard disk storage capacity to cost, increased in cost and this is the since next-gen. So what we're going to be having soon in the years to come, it's going to become more expensive to store a nucleotide than to actually sequence a nucleotide. And so this is a number, never mind all the computes and the people and so forth, but just the challenges there. Of course, one of the things why you still want to store it even though it's more expensive as opposed to just re-sequencing again is a lot of the samples, especially the ones that we deal with in cancer genomics are limited. So we don't have incredible amounts of DNA and so if you seek it once we want to keep the data around to be able to analyze it for a long time. So we talked about the $1,000 genome and soon maybe the $100 genome. This is just a reagents again since it's not the people doing the work. The doubling time of sequencing and the doubling time of CPU and as I mentioned the cost of sequencing and storage. So in general, the challenges we're facing, we have lots of data. We have poor IT infrastructure in many labs. There's obviously places like the OICR or the BC cancer and so forth have dedicated armies of people, small armies, not large armies, of people dedicated to maintaining and keeping up infrastructure. The cloud as a solution that's sort of maintained elsewhere as a commercial entity or a private cloud because that's also a sort of it's not only, it's not as, you know, when is it going to, it's not what, if it's going to happen, but when it's going to happen because we well need to use that kind of infrastructure in the future. So where do we go? We can write more grads, get bigger hardware or look for this guy. There's some companies, unfortunately this company doesn't exist anymore because it got bought off by another company, but basically what its business model was to basically get DNA, sequence it, then ship drives to Amazon and then let people go look at their data and compute in their data at Amazon. So there's actually business models and there are very secure ways of doing this, encrypting what's in the truck, encrypting what's at Amazon itself, encrypting ways you can making sure with FOBS and very, you know, two level of security and so forth. So it's very much, I would argue that Amazon is probably even more secure than some of your institutional IT infrastructure. Many people are already using the cloud, they don't even know it. People use Google Docs, Dropbox, Netflix and Twitter are all things, web services and so forth that are already on the cloud. The way Amazon is able to sort of serve this kind of activity is by having a very, very large number of football fields size sort of data centers and having them all over the world. So there's three or four in the US, there's Ireland, Australia, Singapore and a few other ones I forget. And so they are not just a single baseball field but multiple football fields and basically they come, these containers come in as a unit, they have all the HPCs, all the air conditioning, all the plugs and they just bring it in and plug it in and just add it a thousand cores to their data center and it's just do that all the time and update it and so forth. So of course it's not cheap getting, there's a challenge of getting files to and from there because you're still limited by the slowest network bandwidth that's between you where you stand and where you want your computers, where the Amazon or whichever web service is located, it's not necessarily the best solution for everybody, there's some standardization with respect to cloud infrastructure which is currently not there so Amazon is standardized with the way they do things but there are other Google and Microsoft do it differently and they have different tools and so forth. There's the issue that John mentioned about personal health information and security concerns and in the US they have this thing called the Patriot Act which allows the government to go look at anything without a warrant if they suspect that there's some terrorist activity involved. And so if you're sequencing the tumor of a cousin of a terrorist they may come and just want to look at your DNA. I don't know that this has ever happened, I'm not sure if it's a worry that we shouldn't be worried about but it's an excuse, a concern that is always brought up every time we talk about this and every time we try to open things up people bring up this concern. Personal health information is a real concern and we've circumvented this, circumvented is not the right word, we've arranged this so that it wasn't a concern this week by using either data sets which are fully consented so they're public data sets basically so we can use even though they're from human cell lines so they look like real data and we also, as John mentioned, we're doing like slices of it, smaller parts of it. Some of the advantages, at the K and Bifracks workshop we actually got a grant from Amazon so we're actually getting this web service this week for free, so for free in the sense that we wrote a grant so we got Amazon dollars and so that we're able to do that in this class. But I think it's very smart of Amazon, it's like giving crack to babies, it'll get you guys hooked and then when you'll go home they'll ask for your credit card and then they'll be making money and so just keep that in mind before you get too excited. There are ways, so it was very, I think it was a five gig file size at one time at Amazon and they've changed that so they've become, they're much more bioinformatics aware and so there's actually some very senior people at Amazon that are very much aware with the next gen sequencing technologies and the kind of things we need to do and so, and again to sort of lure you into that, they're actually making the upload of your data so they used to charge for the transfer, just the transfer, never mind doing any computing on it, they would charge it for the transfer but now they don't charge for uploading, so you can bring all your data up to Amazon for free, it can take a while but then the other thing they say is if you have really, really large sort of terabytes type file size that you want to transfer to Amazon because they'll charge you for the storing it too so, without doing any computing yet but if you have large data sets they're really good, the other thing they say is they're really good at shipping stuff and so Amazon is actually knows how to ship things around the world and so you can ship your drives to Amazon and then they'll hook up your data the same way this other company did it. There is actually on Amazon already a number of data sets like the 1000 Genome data set which they're keeping there for free so they actually, they thought of it as an international sort of set that would be of interest to a lot of people and so they're storing it and maintaining that for free so you can go compute on it on the raw data for the 1000 Genome that's available. They also make what we're going to be using as an Amazon machine image or AMI and they make some of these images which have for example there's a cloud bar Linux one which has you can go look for it and you can upload that one and it has from the command line all the sort of biology sort of next-gen sequencing type tools that you will want to use and blast and a bunch of other things that are available that are Perl libraries all the kinds of things that you would want to have available for you to do if you were going to set up a computer they basically is a cloud biolinics group they put an AMI on Amazon for people to download and there's also Galaxy I don't know if any of you have used Galaxy we're not gonna use Galaxy this week but there's a Galaxy image as well so you can run we are okay well yeah so we're gonna we'll get to it yeah thanks Michelle there's another workshop actually which is what I thought Michelle was interrupting me for where we do the high throughput so we have a two-day high throughput bioinformatics workshop where we actually do Galaxy and so we bring up the same AMI that we're gonna use here and the galaxy one as well at the end of the so we have actually a CBW AMI that we're basically put all the tools into and that we maintain at Amazon so at the end of the workshop with your own credit card you can go and look up the CBW AMI and bring the same one it will have the same tools and think that you use this week that will be we're gonna be maintaining that at Amazon so keeping an image at Amazon is not too expensive and so we're happy to do that for the class that we keep that year round and so where I keep we keep mentioning Amazon web services there are many other flavors of clouds private clouds commercial crowd clouds and so forth and by far I would say that the Amazon one is the most well maintained it's got the best tools it's got the most user-friendly interface the best documentation the best support and so forth but it comes at a price so in this workshop we're gonna have some tools that are on your computer you can have some tools on the web and some things on the cloud it's you'll be sort of traversing all these faces quite easily and you're gonna become efficient at sort of figuring out what's the best thing to do and there are different ways of using the clouds there's command line like your very own sort of basically Unix box which is what we're gonna do in this workshop and or you could use a web browser like Galaxy which we're not going to do in this workshop but that is possible as well so you can run Galaxy in Pennsylvania you can run Galaxy in your own server at where in Toronto or you can run Galaxy on the cloud so there's different versions so so we've loaded data on this S3 bucket we've brought up a new Buntu Linux instance and loaded a whole bunch of software for next-gen sequence analysis we've then clone this and made separate instances for everybody in the class so when we load log in you're gonna have your own machine it's going to be your own virtual machine too for yourself and so you won't see other people's data and so forth that said you're basically all using the same password key so it's not very secure from that point of view and that's not the way you would do it normally normally you'd have your own key that would be private to you and only you would know where it is on your computer and so forth so this is a this is not the standard don't go and say oh yes share all our key and things like that that's not the way you do business because what your key will be your credit card number and so you have to sort of guard that the same way you would keep financial information and so so we're gonna have the same login file name with some exceptions and we'll talk about and you're you will have it it will be more secure normally so why are we concerned about security so we talked about this a little bit and John talked a bit about it so for the if I take the example of the ICGC the International Cancer Genome Consortium we have a lot of somatic mutation data which are in themselves not identifiable right unless you have that person's DNA that person's tumor you can match it that way but a somatic mutation is not identifiable it's a random well it's not random it's a number of of marks in your DNA which cause the tumor but themselves are not they're not like your germline variants who themselves are identifiable so the germline variant data itself is considered identifiable data so you can identify the individual and to get access to that data normally at a at a any NIH website at dbgap or at the ebi or at the icgc you have to ask permission and you have to prove to them that you're a real scientist that you're gonna do good things that you have good compute infrastructure that will you know prevent people from stealing your data and that if you screw up you're also getting the signature of somebody who's gonna fire you and they the the DACO office the data or the dbgap and so forth they know that person's name and they know who to contact if they need to you need to get fired so if you sort of do something bad like trying to re-identify a sample trying to give it to people that you're not supposed to give it to and so forth then they can sort of come and hit you with with that kind of security this is not a concern for us this week the data we're looking at is all publicly public data so you don't have to worry about that but in general when you want to look at sensitive data that is identifiable that's got germline variants that's got clinical data information that can be used to identify the individual that usually you need special permission for so on the icgc we have all of this which is open data which is a lot of which we're actually we're going to work with today and there's all of this which is considered controlled access data for which you have to get special permission and i'm going to talk a bit about it more this after yes after lunch so this is a website so maybe you should all log in here and make sure if you haven't done so already so you have this website and the second one is is our workshop here and it's got oogles and oogles of information on how to set up your laptop some tutorial that you've all done some readings that you've all done and so forth so and then the day one uh information so i'm a mac person and a mac is really just a a nice unit box with a nice graphical user interface i look across the classes about half half let's say there's a class so this slide is mac specific and i have a few windows slides that somebody made for me anyway so we're going to be using the terminal so the terminal is hidden in most you think most they really hit the terminal app and there's actually alternatives that can be used but it's in the utilities folder in the application folder so if you go in here you have and down here terminal dot add so this app once you run it usually once it's on my desktop i usually click to keep on desktop so it's always there so i have this one permanently on my desktop and if you double click on that then you get a window that gives you your command line prompt and uh my type so this machine actually this is an old slide because it's the wrong or it's from another machine because this is actually yeah it is from another machine this is actually this machine is beagle eight this is beagle seven prompt uh beagle all my laptops are named beagle because of darwin he's traveling and so i've had lots of beagles so um so anyway so if you do ls command uh and i made a directory for cbw so make their cbw cd cbw ls minus l then i la so long listing showing all the hidden files basically this is an empty directory so that's my starting uh the way i'm starting is everybody everybody knows how to do this can everybody do it has everybody done it right so far so good okay switch to the wiki page there's a heading logging on to the amazon cloud crash course on cloud there's a cloud lecture which is this lecture right now you can download and keep so we have actually i think it's 50 40 or 50 so there's two here this is where the the verge the world divide between the the mac and the pc people so i'm going to do the mac so the macOS people first of all anybody on the linux box here yes oh yeah oh yeah that's right okay so mac and linux are are together so basically on the mac you sort of hit the control and uh you get this pop-up coming up and you save this file as and so that after you've done this you should have this c w key dot p m or uh on the pc it's called uh pack what is it is that i mean pa ppk yeah so anyway so this is the on the windows on the on the on the mac so uh cbw key and if you look um and so one instruction we tell you to do and it's in the the command is detailed you should change uh the permission so the command is ch mod 600 or you can do 400 as well i think uh cbw and then the file name if you do ls again so the file is still there my pointer is about to die the file is still there but if you look at the permissions they're changed so they used to be read write read read and so forth now it's only read write by the file owner if you look at what this file is cat the the the file name it's it's a text file but it's it's a long password right so so we all have this file to log in it's all we you have to have this file to be able to log in to the cloud and we've told the cloud to expect this file so everybody's got is getting the same file normally you need to add your own file just a quick crash course on those of you that that don't know about permissions basically a file name would have uh read write rwx rwx rwx three times and it relates to the read write and execute permission of the owner of the file of the group that the owner belongs to or what i refer to as the world everybody else on the network it's not really the world but that's a sort of old unix stuff basically everybody else on the network not necessarily people that are not in your group and read if you count if you put numbers add integers read is four write is two execute is one and you can add all these number up and every time whatever sum you get you know which integers were used to add up to it so four plus two is six there's only one way of getting six is adding four and two so it's read and write four and one is is only there's only one is five and there's only one way of getting that it's read and execute and so forth so if i say change mode six zero zero i'm saying read and execute sorry read and write so four plus two is six for the owner but not the group and not the world and so the permissions go from what they are here actually i had on the actually put on this slide so they go from uh read write so they were six four four and they've gone six zero zero everybody got that everybody knows understands that sorry well you're repeating again so basically with the command so what way do you change like so so you have to change so it has to look like this it has to look like read if you look if you do a long listing if you do ls minus la it should look like this like this one here read and write by the owner only not by the world and not by the group so if you this is how it was before the top and this is how after you run this change mod command it becomes rw and if you don't have that it'll fail when you try to log in so if it fails that i'll know you didn't do it uh i'm asking people to do this change mod as as i go along yes so okay you want me to pause okay i'm happy to pause i'm happy to take a sip of my coffee okay that's a good point yes so i made so i made a directory off my main directory i made a cbw directory and that's where i'm putting everything today and that's just doesn't have to be you can do whichever way you like but that's an easy way of doing it so you go to the command line you go to your home directory you create a directory called cbw so m mkdir make directory space cbw in unix case matters and space matters so don't forget about these things so the upper lower case is very important and um spaces and so forth it's actually a typo in one of my slides where i put a space there should be one so if you try to do what's on the space yes that's correct yes the windows people you can either look down the next few pages they're in the notes there's that actually uh let me actually go to that to those pages right away so let me squish i'm gonna skip ahead to the windows so the windows people you're supposed to download a program called putty so you're supposed to install that program ahead of time and these are the configuration for putty from last year so there's uh and oh yeah so here this is the wrong file name this is 2012 it should be 2013 so the windows people i mean you don't have this file right now you don't have this ppk file you have uh cbw 2013 or whatever it is called so so if you go back if you go back over here you get to download this certificate here so the windows people download the one at the end of the paragraph there's two certificates on top of each other it might not be on top of each other depends how wide your screen is but so the first one is for mac and the second one is for windows people so the first one is for mac and linux the second one is for windows this is important to get it right because if you don't get it right it's going to be a long wait so so you've got you've got to copy one file so far i've told you to copy is this first i to make a directory cbw and then uh to make uh the certificate uh and download this file it is a text file oh yes yes you have to change it rename it to dot ppk that's correct it is a text file but it needs the correct uh extension dot ppk yeah your your uh your pc is trying to be too smart trying to outsmart the mac that's true it is it does yeah yes they're both trying out smart okay does everybody have it no so okay that's why you have the very experienced ta there with you yeah you'll see we'll be doing great things once we pass this first hurdle huh it was easy on windows i was easy on the mac that's not telling you how that's not starting a mac i didn't get to lose it teach your own i mean it's whatever you you like yeah so oh yeah this one yeah this first one or the second one the first one so oh yeah so so this first one here this is actually uh i was gonna mention later but so the host name is the number you're putting there is the one behind your badge right so it's you're not putting your cbw zero one is for this instructor whoever zero one is but you put your own number here so in the number your number is the one behind your badge and you put this you copy this this is the basically that's the name of the instance that you're going to log into and is the is the ssh so okay uh oh yeah zero one oh yeah yeah yeah so first one cbw your number ssh01.com is the name that's our everybody has that too the same yes and we're going to do secure shell ssh and we're going to do uh you save it and call it cbw okay okay and then uh the username is going to do ubuntu so we're all calling ubuntu all right one second just go there okay okay and then prompt extra i think that's probably the default settings everything else is left blank okay next screen and then so you can browse and get this file here which is going to have the right number the right name the one you've downloaded already this is a file that you downloaded from the wiki right and then you uh and these are probably default settings so did you look did you say from the uh you have that now yeah okay so what does it say log in as ubuntu ubuntu you all are i wish i should have practiced more on the windows box guilty are you okay is it looking good no uh as i've been if you want to so if you go to this page so the mac people if you want to try logging in so you do s from a command from the prompt you go ssh minus space minus i space the key the cbw key.pm space ubuntu which is your username so we're all called ubuntu so we all have the same username at cbw c and the number is your number behind your badge dot ssh01.com so so the cbw number dot ssh.com that's uh that part that's basically the user at this machine so we are we so xivens is set up 40 machines label oh yeah and if your number is one through nine oh you had that on the on the badge yeah don't forget to put the zero before the the digit yeah and so um yeah i'll tell you don't don't don't don't do that yet don't worry about that yet you don't have to copy anything right now okay good so who does not have it yet you still don't have it our linux is it's a good opportunity to think about it no uh yeah i think we have one left yeah and i think somebody's on top a windows person is helping okay bin oh you okay is she okay maybe you want i think it's one more left okay good very good very good sorry about that uh linux and windows people okay so the so if you do this you should be able let me actually let me do that myself so i'm gonna actually escape my powerpoint here i'm gonna switch so i'm gonna do oops so so this is my prompt right i do ssh minus i cwp.pm ubuntu at cb27 that's my number not at this age okay and if i type that in this is good so ubuntu i'm now ubuntu at this ip address this is me right so if you have something like that you you you you have success you are now on amazon aren't you excited i'm very excited so let's try um ls minus la whoa let me take that over just do ls minus l and let me clear the screen first ls minus l there we go so if i do ls minus l so that's a long listing of the files and directories i have in my directory in my home directory because when you log in you start in your home directory so i have a bin directory i have a course data uh l in the first column is everybody listening to me yeah yeah okay so the first column tells you if it's a file a directory or a link symbolic link so bin is a directory l is a link and you can see it's a link because you see courseware arrow is actually linked to another file somewhere else so it's a symbolic link to that to that directory but we made it easier for you so you want to go see the course data just go look to course data you don't have to type slash media slash tjv data slash course data so just to make it it's a way of making it easier how much is how much is what oh 120 gigs so it's not that much it's kind of that's not much no no so i did you mention yeah you mentioned so we have at oicr we have four petabytes almost so three and a half petabytes so it gives you so the gigs tear up so it gives you and it's it takes sometimes it takes days to copy a file over from one place to another place let me just go back to my okay so we did this so if you wanted to copy files from an instance this would be actually a one way of doing it so you could copy scp it's not very clear here on the focus so secure copy so it's a secure way of copying files minus i you use this password file right so you do minus i then log in to where to where you want to copy the file to and what it is you want to copy so you want to have copy this long path i'm not don't do it now i'm just telling you how you would do it and then period means here so you copy from there to here so the other very useful thing that we have if you put click on this you should do right now if you go to your browser and you type in http colon slash slash cbw your number dot snh01.com you'll actually have a browser of all your files so it's very convenient just so if you want to copy files from your browse from your amazon account to your local account to your laptop you just go here click on it and save so that's a very simple way of getting files again this we opened up all the ports we made it easy for the whole world could be looking at this they're not much interest in them doing it they can't put anything they can look at it and so don't put any personal information there but basically this is not the way you would set it up on amazon normally but this is the way we set it up for this week to make it easy for you to transfer files and things like that and it's very well documented on amazon if you want more documentation or more help you can pay for it they're very happy to provide the service for a fee so there's a window stuff so at this point your laptop is actually ready for the workshop right uh if you don't know if it's not then you know where to get the information you need or you just don't have any lunch um you know where the wiki is and and the wiki is going to be used by every faculty this week a lot of the work examples a lot of the answers a lot of the questions everything will be on the wiki so make sure you can you set up your desks up so on on um on a mac i i have uh and i'm sure i know windows you can do that as well you have different desktops you have one desktop with one thing and then you just switch between desktops you copy and paste from one desktop to the other and it's very convenient especially one year on now i have less resolution so you have the wiki set up so and you know where all the lectures are all the lectures are in your book but they're also electronically on the wiki so everything is there and download it you know save it and so forth um i've actually added pre lecture material reading that you should read over lunch no just kidding i've added some more papers to the list so they're all open access papers you can download those papers and keep them and use them and so forth i'll be referring to them and you can read them later or ask me questions about them later uh and you now know how to log into the aws any questions and uh so mostly uh you should you don't need to long to have your laptop for this part so you can turn flip down your laptops unless you need it as michelle mentioned um and uh what i'm going to talk to you about right now is i'm going to do that for the next 45 minutes until lunchtime and then michelle's going to wave at me when i'm done when i have to stop is so i'm going to talk to you about uh databases and visualization tool and i'm going to mostly talk about uh the importance of databases not only in cancer genomics but in in all bioinformatics uh as we as we do it worldwide as we do bioinformatics in in all activities and not necessarily just in cancer uh usual disclaimer same learning objectives um and i'm going to take a little step back because i think it's important to really set the the right context is that uh about the inferences we make when we're interpreting biological data and we're doing bioinformatics it's always in the in the context of evolution and if it wasn't for evolution none of this would make sense there's there's evolution at the sort of the uh species level and there's evolution at the uh within a species but there's also of course evolution within a tumor and so there's a tumor evolution that uh the mutations that are acquired by a tumor gave it sort of growth advantages that allow it to grow and outdo the other cells around it so you have to keep that it's really important to sort of uh and i sort of adapted this line by nothing and bioinformatics makes sense except in the light of revolution so why do we have bioinformatics that actually the reason we have bioinformatics is that we have open data if we didn't have gen bank for example um things like blast would have never been invented i mean the reason we have blast is that we needed to figure out a way of searching through in growing availability of open data which was the dna sequence data uh in the sort of early 80s and 90s and so forth um which allowed us to to go find things that were uh similar to the things that uh we were looking for and the main reason for something like blast has been used is for finding uh similarity in sequences and and from which we infer function and so it's always so the structure function relationship that we have been doing in bioinformatics for 20 30 years now and and actually longer than that more like 50 years since the uh the protein families alignments have been done in the early 60s so um so how do we how do we define a bioinformatics and this is something you i'm sure you've done you all have the same answer so i'm gonna ask you so i'm gonna ask you to pair up together to write down in 140 characters or less your definition of bioinformatics so pair up with the person next to you and write down what your definition of bioinformatics is okay well here's mine actually it is more than 140 characters it's about integrating uh biological uh themes together with the help of computer tools and biological databases and gaining new knowledge about the systems and study and uh this is really uh that's what you said yes you must have you must have read my book um so it's yeah so we're so and a bioinformatician is different people have you know our computational biologists lots of people have different definitions of what they those people are and are they tool developers or they tool users or they website users and so forth i'm very uh generalists and i include all of those people uh being involved in computational or bioinformatics activities even the difference between bioinformatics and computational biology some people have gone to war over these these two terms and i i i don't go there so that's i really try to uh use i try to use both interchangeably as yes there is this very slippery slope you enter my friend so yeah so are you writing up a cv for job application or something so there are institutions that will they will seek a computational biologist that they're looking for and they there's lots of people have made this sort of distinction between sort of more technology versus scientific endeavor versus algorithm development that said i've seen the other as well so it's you have to be i prefer to be inclusive and to include everybody in the fund so that's definitely uh the way i look at things so one of the things about uh doing a bioinformatics experiment is considering the database as one of your reagents for your experiment and so it's a really sort of a way of thinking about uh a computational experiment as having reagents and you do things and you record it and and so forth so it's really as far databases is one of these reagents is an organized array of information it's a place where you put things in and if all is well you should be able to get them back out again there's lots of databases that people load data and then they can't find the data again so that's not a very good database it's a resource for other database and tools to use so obviously if you have a database that has an application programming interface that allows other tools and other people through the web to get stuff out that's that's a that's a useful database and a bonus is that it allows you to make discoveries to find association between things that you didn't know existed beforehand and so that's a that's a very well designed database that will allow you to make that kind of stuff and what's always important when you're sort of using a database or building a database but mostly when you're using one is to understand how are they organizing the information here what are the what do the identifiers mean what do if the identifier changes what does that mean what are the organisms organism and organismal scope of this database for example and bioinformatics and so forth so those are all sort of very interesting important things so bioinformatics experiment if you do a blast search for example you have to know your reagents so you have to know the sequence the query sequence you're using and the database that you're searching against you have to know the tools you're using and so which methods are you using you're doing a blast piece or protein against protein are you doing a t blast x which will be a translation of your nucleotide against a translation of the database and so understanding all the and the implications of using one type or any other is very important as well and at the end the alignment you have interpretation the similarity the hypothesis testing and so forth and so it's important to know your reagents it's important to know your methods and it's also important to do controls so what kind of control can you do with a blast search well one type of control is is do i find the sequence i'm i'm expecting the one that i know is there do i find it if i don't find it then maybe by the database i'm searching against it's not the right database maybe the parameters i'm using using a default parameters and i'm not able to find the things i'm looking for and so forth so understanding the parameters understand so you shouldn't consider these tools i'm using blast as an example but it could be any tool you're going to be using this week is that if you use it as a black box and you're just sticking stuff in and looking what's coming out the other end it's going to be really bad news and so it's really critical in these bioinformatics experiment that we're going to do this week to understand what you're putting in and what you're coming out what's coming out the other end and why things are behaving the way they're not and sometimes or the way they are and sometimes if they're not behaving the way you expect them to behave is that it could be there's something wrong maybe amazon's not happy maybe uh the uh start with the wrong file the wrong file format and so forth and you're missing some parameters you have the wrong parameters and so forth so those are all really sort of critical things and another thing about databases is that we uh i remember this is less prevalent now but i remember in the 90s and in 2000 people used to complain all the time oh gen bank is full of garbage blah blah blah you know but then they would say that are meanings or they'd say that in papers but they wouldn't tell gen bank folks and so it's really and gen bank is a resource and all these databases with ncbi or the ebi are all these public resources that are there for us to use so it's really critical that if we do find a mistake that we report and we tell them you know there's a because if you don't report it somebody else is going to find the mistake or somebody else is going to misinterpret something and so it's really it's our responsibility as citizens of of bioinformaticians worldwide computational biologists worldwide it's really our responsibility to to to report these these challenges or these problems if we come across them and sometimes they're not a problem is you don't know how to use the tool and so if you reported that you know the database people will be more than happy to set you straight and so just a quick overview of databases and the various sort of layers of of of a way we can think about it so the data itself so gen bank flat file cosmic record an interaction record protein protein interaction record titles of a book the book itself that's part of your data the storage system could be a box it could be a Oracle it could be my sql it could be a pc binary file unix text file or bookshelf these are all the storage system the layers that we have to sort of the query system so a list you look at a catalog index files structured query language or grep which is a unix tool that allows you to go find look through text files quite rapidly and the information system so the library of congress in the u.s. is an information system google is an information system and cbi ensemble is an information system and the ucsc genome browser things are all sort of complicated very structured and to really get the maximum of each of these you have to go deep into into them and understand what's happening so the databases have been growing this is from a few years ago already and it's quite a lot of stuff that we have to look at if you look at sort of things from 12 years beforehand you can see i think it's the next one yeah you see that some of that like the nucleotide records there's been a 32 fold increase so from 4 million records in 1999 to 144 million records in 2011 and now we're quite a bit more i remember so i used to work at ncbi i was there until 97 i think we we had a party when we hit 1 million and so i remember and when i started there in 93 so this you know six years before that there was 300 000 records in genbank so just give you a scale of things how they've changed over the years so um the the tools have changed also and if you look this is actually the number of records in all the databases and to get to this page you put in the query all this is a sort of a under the hood secret how do you get this page with to get the numbers of all the databases and the query you type all in square bracket filter and if you put this query in then you get all the numbers of all how many records there is in each of these databases so you should do that later so formats are are very important so we're going to talk actually i'm not going to talk about vcf format because that's going to come later in the course but vcf is a variant file format that we look at but they're based on on older format they're based on on on old things that are really sort of important sort of crucial to understand the genbank file file is basically most of you who has not seen a genbank file i hope it's okay yeah so it's basically the unit record in genbank which has uh information about the whole record so it will have a publication it'll have organism they don't have specific information about features in that record let's say for uh it will have a protein coding sequence which is going to be a segment within a messenger RNA and then it'll have the sequence itself and so the header will have the the the tile the taxonomy citation features amino acid sequence and then the DNA sequence genbank is organism agnostic in the sense that it will use it will it's not just human of course it's everything even there's some a few synthetic sequences that have snuck in there over the years and which are like clones and things like that that have been constructed and it's been useful to have them in genbanks and there's a separate division for those records but they all have the same features the genbank flat file that we were looking at here is considered a human readable format but there are so many bioinformaticians worldwide that have parsed and and try to sort of work off of this file format which was never meant to be a parsable format i was never meant to be able to identify all the various fields in a way for a computer to look at it it was always meant for humans to look at it but that doesn't stop bioinformaticians and then there they've been lots and lots and lots of people that have parsed these files there are other file formats much better than this but they they've not picked up these this is like just the most popular file format more popular than genbank flat file format would be the fast day file format which is sort of it's become the default sort of sequence file it's for the same file for nucleotides or protein so you and there's no way of knowing ahead of time until you look at the file whether it's nucleotide or protein and all fast day files have a greater than sign on the first line they have a string of something and then they have a sequence that's about the formula the definition this could be this is a fast day file it's got nothing else on it and that's good it's not very useful because you sort of lose track of what's in there and so to have so this is an NCBI formatted fast day file which has the GI number or GI string which means gen info and this is a GI number it came from swiss broad so it's sp the swiss broad accession number and then the swiss broad name and so that's what this file is it's a it's a yeast GCN for protein so sequences come in primary archival databases and so there's jen bank you know product PDB so protein and and text intact is a protein protein interaction and there are what I call secondary these are curated but they're not as curated as the secondary databases which are take this data and add another level of curation so RefSeq is is very commonly used will be used by us this week in the sense that RefSeq is the reference sequence for the even genome all the transcripts of RefSeq transcripts are used often in as references in a lot of genome browsers tax on database taxonomic database which is a all the taxonomy and it's actually very hard to maintain and it requires but you need to basically you need an authority you need experts in the field to tell you this is this organism and not that organism and this is the lineage and so forth for human it's not a big problem because we're sort of pretty much agreed upon but if you start looking at marsupials and and different sort of weirder organisms and bacteria is up to wazoo is is is very complicated and so NCBI is a maintainer of the taxonomic database for all the nucleotide and protein sequence databases and they are there the final authority with respect to sequence databases and they use sequence to help them whenever there's some discrepancy and so the beak size may sort of put some some organism in a different group but the sequence says they're in that group so that's that's that's that there's SGD is one of the model organism databases so this for saccharomycese there's one for mouse there's one for rat there's one for zebrafish and so forth these are all very critical for human because a lot of our human knowledge actually comes from from the model organism databases OMIM is online mid-dialine inheritance in man which is basically a disease linking disease to to to genes so every year the nucleic acid research publishes in january the database issue and this is the sort of the the best place to get the reference for all the database there are actually so many databases that they don't publish databases don't publish there every year they publish there every second year so to get a full sort of picture of the full set of databases that are available you actually have to look at the last two years worth of database issue so this is this january and this is a paper i added to your list which is the cbi resources and this is last january and there's more papers and so i mentioned archival and so forth so so gen bank is reference here it's the genetic sequence database of all publicly available sequences and there's actually NCBI that produces gen bank is part of a three-way relationship between the japanese the europeans and the americans and where you submit to only one of these and it ends up being part of of of the of the of the of gen bank basically so you can submit through japan it goes into ddbj but then it gets picked up by NCBI and the ebi and so you only need to submit the one you don't need to submit to all three but it appears and these three databases are basically equivalent the thing about gen bank there are many many file types there are many uh various uh tags and features and so forth i'm just going to sort of touch on a few of them which are i think are important for for us this week but keep in mind that there's lots more that i'm not covering um there are organismal divisions so for example bacterial primate rodents and so forth and those are actually historical there are just ways of limiting file size and when they were being distributed and so forth and it actually doesn't make sense anymore that's why there's like hundreds some of these come in hundred you know there's uh hundred bacterial files and so forth because the one was too big and so they broke it up into multiple files but more important are these sort of what i call functional divisions and they're a ways of of that sort of made sense to partition things into separate piles and so during the the human genome project for example uh because there was a mandate to get the data out within 24 hours of having it off the sequencers there a lot of data wasn't finished and so it was being assembled it was a first sort of assembly done but there was just a bunch of pieces and so and some some of the groups were doing it sort of back by back across the genome but even the back you your first run you got a bunch of pieces and so forth and so they put this in in the hgg the high throughput genome uh which is unfinished genome so if you see a record in there it is still working on it's still not finished and so uh more and more and once it got finished then it became uh if it was humid then it got part of primates so pr so it's for primates and so same with all of these ests where a good place express sequence tags there's short reads single reads from technology that used to generate uh three four five hundred base pairs at a time they used to contaminate the databases until people figured out let's put them all in one pile and let's compute on them separately and then they became useful as a separate pile contaminated amongst all the other ones it didn't make sense but having them in a separate pile made them uh useful and so a guiding principle in in gen bank and a lot of databases if you things are grouped together for a certain reason and understanding why things are grouped that way makes a lot of sense so this uh the identifiers is going to be is actually one of the most critical things that that uh from from my lecture today is to understand how important and what the identifier means and so um let me skip and there are different parts right there's the the dna itself the genes the transcripts the proteins and so forth so in a gen bank record i'm going to skip that right away so each gen bank record has a locus line so the first line of a gen bank record is the locus slide has a locus id which is actually a very bad id don't use this id so why should i not use this id actually between gen bank and ddbj and ebi they have most of them are the same but sometimes they're they're not the same and so that you get the the fact that the same id has got different value in different places across the world that's not very useful the accession number on the other hand is a very good id because that's a unique identifier to this record okay and a few years ago they started adding version numbers that is even more important so what does what does that mean when a version number changes no that's not what that means sorry variance no no nothing nothing it was variance if i go if i switch from it goes from u four zero two eight two dot one it becomes dot two what does that tell me what has changed the record the actual sequence so all the version number tells you when there's a change in the version number that means the sequence is changed it could have changed by one nucleotide or it could have changed by five megabases i mean or any any number so it doesn't tell you doesn't refer to the size of the change it just tells you that this is no longer the same sequence and that's all it means it could be for example if this is an accession number on mRNA if i change the coding sequence but i don't change the mRNA sequence and i get a new record coming out in a database uh it's it won't change this number the version number will not change because it's still the same nucleotide sequence the protein sequence has a dot version as well so if the protein sequence change then it will increment and if i have a nucleotide change that produces an amino acid change then both will change but the version number change in gen bank and in uh other things as well means that there's a sequence change yes yes yeah you're skip a few pages you'll see so historically before the three nucleotide databases agreed upon this structure of accession dot version number um the other databases didn't want to do it so ncbi actually hid the same information in the gi number and gi stands for gen info so so all records just still have gi numbers but now you don't really need there's a one-to-one correlation between an accession dot version and a gi the difference is that the next if i if the sequence we're going to change this would become dot two and the next gi could be some other string of number so you would you'd have no relationships between them you wouldn't you couldn't tell if the the two gi number if were related unless you look in this record so this record would have this was the old gi this is the new gi but what it does have is that if it's dot two then you know it was place dot one and as i alluded to this didn't exist at the beginning of the databases so there's a bunch of records that had a bunch of changes that were dot zero forever basically so that never changed so the dot one when when this was introduced everything became dot one even though there may have been historically a lot of changes beforehand fortunately most of the data that's coming into gen bank came this year and not the first five years because in the first five years there's only one million record and now we're in hundreds of millions of records so most of the data if you made a data change or a structure you know a sort of model change then it it makes sense to so then proteins i mentioned so protein have gi numbers as well and they also have there's a protein field that has accession dot number of structure so if you go to a gen bank record at ncbi you pop down the display menu then you have a revision history and so i'm looking at now this accession number 005517 dot seven and i sort of click on these two and then i can see the difference between the two records so the date is different uh the taxonomy change the sequence change whilst the the gi number changed and the sequence changed and so and actually i could see that it's actually 20 nucleotides shorter and the uh newer actually the newer record is lost 20 nucleotides so it's it's not one it's not a million but it's 20 and so that's why it's a different gi number and it's a different version number so switch from dot six to dot seven the gi number change and you see the gi numbers don't make any relationships to each other but now we we understand uh the differences and this i did that between looking between six and seven actually it's highlighting wrong here but between six and seven but down here was zero all the time there's different gi numbers so the sequence changed but in 1999 they instituted the the dot version numbers and so that became dot one and then thereafter the more changes that said most records in gen bank don't change most people submit something to gen bank and they they forget about it so but if they do change it and some people are very attached to their sequence uh they they take very good care and sort of lots of tender loving care and so forth and so this happens all the records in gen bank belong to the submitter so NCBI is not even allowed to go change it it's not NCBI doing these changes it's the author doing the change if there's a change that needs to be done the database may contact uses we notice some vector contamination in your file can we remove it and they'll say sure yes please you know this is very embarrassing and so forth yes so it's not random it could be uh so there's no implicit relationship between two accession numbers that follow each other except if I submit two sequences I will get two successive accession numbers so it's useful if I'm submitting a hundred sequences that in my paper I would say from this one to that one these are all mine and so I will I will get a hundred consecutive accession numbers and the same way I think it sort of showed up here um see three one five yeah yeah so the the gi we're also given as successfully so the gi of the dna and the gi of the protein are off by one digit so in the old days accession numbers were one letter plus five digits so l one two three four five or u zero zero zero one and then we ran out of space so then we have two plus six so we have af and ac and so forth and then next gen sequencing started coming then there were so many sequences that never made it into gen bank there's actually more stuff not in gen bank than there is in gen bank so if you take the pile if you read the the gen bank release notes they'll tell you we have 150 million records or 200 million records but there's also 200 million records of whole genome sequencing that we have not we're not putting in gen bank right now they're sort of in in process and what does that mean that means they're not shared so if it's in gen bank then it's shared with dvj and and europeans but if it's not shared then it just you can the only place to find it is at ncbi so how do you get to that data and what's in that data that data is going to have lots of of projects from obscure and not so obscure organisms that are just in process and so they're not assembled they're not finished they're not but they want they're they're making submitters are making the data available so ncbi is making it available but there's so much data that they're not sharing with everybody until it's more advanced and so how do you query this you can go look there's a website i'll show you or you can there's a blast you can tell blast to go look at this pile of unknown stuff and and you you know buy or be aware uh it's stuff it's not curated it's it's just off the machine basically but if you're looking for your favorite gene you may want to go do the extra work to go hunt that piece of dna down so there's uh the concept of a secondary accession numbers and so the accession line here will have u 00096 and then it will have actually this string here uh a zero zero one one one dash a zero so this is actually like 400 records which are now secondary to this primary record and this is this is a little historical historically gen bank only maximum size is 250 kb and this is because a bunch of software that's all they could handle and ncbi was calling to those softwares as some of you anybody ever use gcg yes the gcg had a 350k file size limit and so when the ecoli genome came out which was five megabases they broke it down into a bunch of records and uh and so but then they removed the file size limit so they stuck them all back together so the ecoli now is happy as one piece of dna of five megabases but historically it's been published as separate pieces and so if you query for any one of the pieces then you have to go be able to find it so that's this takes care of that so uh yeah so this is all basically how it looks like on the ecoli genome which is now a 4.6 megabase the express sequence tags are mentioned quickly so this is a dog sequence and the important part about uh es t is is because it's an mr and it's a tag of of gene expression so where is it expressed and under what condition and so which cell line which developmental stage and so forth those are the key things not necessarily what does it encode because you're going to computationally derive that uh from the est data because the est data the other thing about the est data is sort of the the ends of the record are less uh lower quality and so and there are still surprising how many ests are still coming out i mean there's uh and of course now rna seek is going to proceed all of this but it's uh and there's a separate division for that as well actually this is uh shotgun transcriptome shotgun data uh speaking of which is basically emerging of est and and next gen data and it's a computationally assembled record uh so yeah and the big difference is that transcriptional shotgun assemblies are not uh different from est and gen back that there is no physical counterpart of these assemblies and est there's always a cda clone somewhere in a fridge and a tsa record that's not necessarily you have that so it'll be um uh there's not that as john mentioned earlier there's not the clone equivalent in when you sequence and so that's what those records look like and uh they'll have protein IDs and so forth so this is i just mentioned so the whole genome wgs records ongoing whole genome shotgun sequencing projects uh you can find them with blast there's more information there and they have yet again a different uh accession because you can use up accession number is quite fast uh so this is four plus two plus six so and there's a page which has all the projects is the four letter code the two letter project code and then the organism and what state that that project is so you can find it here or you as i mentioned in blast so in gen bank it's pretty sort of cryptic not that much information there might be a paper it's a paper from the brode and basically just organism information and then the sequence these sequences are not in gen bank uh wgs i just mentioned tpa third party annotation so third party annotation used to be frowned upon and not allowed in gen bank still not but there was there's a place for them at ncbi and ncbi will take them if there's a publication attached to it so if you so if you you're reanalyzing somebody else's data and you wrote a paper the paper was published then you can put in tpa snips single nucleotide polymorphisms we're going to talk a lot about this week they're not in gen bank for safe right they're in snipped eb so there's a separate database for snips uh sage which is another sort of gene expression uh tag uh they're they're in geo they're somewhere else ref seek so ref seek those are like him the like him of all the gen bank records those are basically it was ncbi's trick for to edit a record so if i take the best mRNA sequences for example from the human genome and i want to annotate them all the same way across an organism how do i do that so i take the ones from gen bank which are open and available i make them my own i edit them and i put them in a separate database the ref seek database all the records in ref seek come from gen bank so they all were sequenced by somebody and so forth but they all got reanitated and recurated in a standard way across the whole set and so you have ref seek of mRNA you have ref seek of proteins you have ref seeks of genomes so all of these three uh spaces have uh they're all uniform and they're supposed to be not all organisms have ref sequences but many of them do uh uniprot kb let me let me stop here actually i'm going to stop here and then we'll we'll finish uh this after lunch okay any questions comments concerns you all know about gi numbers now good thank you