 The Creative Commons license is to declare that all the license on this lecture is a Creative Commons share like, which means that if you use slides in this talk, which you're allowed to use, remix means that you're allowed to take the one slide you want out of the whole deck and then you can take that and include in your own presentation. And all you have to do is you have to do two things. You have to say where you got it from and you have to share your slides as well. So today I'm going to talk to you about databases and visualization tools and really databases as really one of the key ingredients of many, if not all, the sort of the bioinformatics activity that we do. We always rely on databases and we take them from granted often and we use them. And so I'm just going to give you a bit of ground about them. Disclaimer, just to say that I may mention some company names and some products and so forth. And I will not make any profit from any mention of these companies or I don't have any relationship to any of them. So not a problem. In my email address, feel free to contact me after class, my Twitter handle and if you'd like to tweet, this is the hashtag for this workshop and for the workshop series as a whole. So some of the learning objectives today is to the review of databases and that we use in bioinformatics and we use in cancer genomics. And I'm just going to touch on a few of them. There are, I'm going to give you some references where you can go find the rest of them. There are many, many. And so there's definitely, but with what I give you today, I think it'll give you a good sort of sense of what kind of things to go look for. Then we'll talk about visualization of cancer genomic data at icgc.org, visualization of data using IGV, the integrated genome viewer. And also we'll be doing an introduction to cloud computing, why we're using clouds and how to log into your account basically so that you'll be, this is where you'll be doing most of your computes throughout the rest of the week so that we are generously sponsored by Amazon Web Services. And so we write an education grant and they give us dollars for you. So they give us Amazon dollars, not real dollars that we can share with you. And so that part is, should be taken into account when you're doing work on Amazon. It's today and this week it will be free, but when you do it on your own, it's not free. So the actual costs for doing genome assemblies and are not minimal, but they're not extravagant either. And it's really, it's an interesting sort of calculation to see, do it your own versus setting up your own HPC infrastructure and so forth. And we can have that discussion a bit later. So this is going to be really at the beginning sort of introduction bioinformatics into the cancer genomic space. And of course, nothing in this is a quote, a great quote that nothing biology makes sense, except in the light of evolution. And we can sort of say nothing in bioinformatics makes sense, except in the light of evolution sort of stole it from somebody. And so, and really all the inferences we make all the our understanding about computational biology about evolution of tumors about the function of proteins and so forth comes from our understanding of what gene products do comes from our understanding of evolution of organisms evolution of tumors and so forth. So, as a first sort of thinking exercise, why, why do we have bioinformatics while we sort of have bioinformatics because we have open data really the fact that we had something like a bank allow development of tools like blast. So how many of you have done blast search before? How many of you have never done a blast search before? So blast you should find out this week is not something we're going to do in this class, but it's something you should go look at. And it is a basically a search tool to go look for similar sequence to your sequence of interest. And then you have your target, your interest sequence of interest. And then you go see, go quickly look across all of GenBank across all of Uniprod various databases, which ones is the most similar. And it's sort of it takes shortcut because it's actually a very complicated computer science problem to compare your string of sequence against all the known strings. And so it takes some some heuristics where it takes some shortcuts, but they're really sort of smart shortcuts that allow for that actually the algorithm to be relatively fast and to find related very similar and related sequences. And so, but that searching that space became a problem because we had GenBank GenBank is an open access open data database resource that we needed tools to look into it. If everybody kept their sequences to themselves, it would be very much like looking for sequences and you're, you know, you can almost do it by hand by lining things up and so forth. But the fact that we have millions, if not hundreds of millions of sequences makes it really sort of challenging. And therefore the requirement to develop tools like like like blast. So I'm sure you guys think about this all the time about what the definition of bioinformatics is. I'm actually going to challenge you with that right now. I'm going to ask you to talk to the person sitting next to you and think and pair up with that person and share what you think your definition of one of your yellow stickies. I guess that's what they're for, right? Michelle? The yellow stickies? The green, green or yellow? Not the red one. Keep your red ones for later. The red one is going to be very important when you're in need of help. So think, write down a definition of what you think bioinformatics is. What is your definition of bioinformatics? And don't look at my definition of bioinformatics. It's probably there on the next slide or something. Yeah, it is. It's too easy. I'm going to complicate that. I have to shuffle my slides I think next time. You have to share with the person next to you. You have to communicate. This is actually a part of the class where you're supposed to talk because you've married them. What is bioinformatics? Okay, does anybody have an answer? They want to volunteer? That's a very good one. I like that one. Anybody else? Yeah? The TAs have one over there? No? I thought you were raising your hand. Sorry, I don't want to put you in a spot there. Anybody else? No? Peter. Which Peter? This Peter. Okay. It's like reverse engineering biological systems. So this is my definition. So bioinformatics is about and it really captures. I think if you ask a room of 100 people, you get 100 bioinformaticians, you'll get 100 different answers. But bioinformatics is about integrating biological themes together with the help of computer tools and biological databases and getting new knowledge. And it's about the system of study. So it's really bioinformatics is really a biologically driven field. So we're all lots of software engineer friends that are in bioinformatics. But the questions we answer at bioinformatics and computational biology in general are biological questions. We try to understand. We're not trying. I mean, of course, some of our friends are there to make BLAST go faster. And that's really it's an algorithm software engineering problem. But to understand why what you get out of BLAST and to understand what it is you get out of the tools you're using in bioinformatics, you have to understand the biological questions to understand evolution to understand how a particular relationship between the function, for example, of one protein can be inferred to be present in the other protein by their similarity of certain domains and so forth. So it's really all about reverse engineering the cell but in doing so with biological mindset in the first place. Another very important thing about bioinformatics that I touched upon and it's really sort of core because to the to the whole field is that we are totally dependent on other people. We actually bioinformaticians in general do not generate data. They use other people's data. And so we're really sort of dependent on an open sort of mentality in an open community. So sort of open data concept is sort of central to the way we do our work. Likewise, there is a lot of open source software. So a lot of software that's used in bioinformatics is open source. And so people say, well, it's as good as what you pay for it and so forth. And actually, I would argue that in bioinformatics, that's not true in the sense that there's actually a number of great software packages which are community driven, which are free of charge but that are cost a lot in the sense that a lot of people's grants, a lot of people's effort and so forth have gone into the development of these packages. And not only that, but it's also often the challenge for many of these is not to get them started but it's to get them maintained and continued and supported. So grants are really Genome Cata and CIHR and IH and all these various funding agencies are really keen to fund new things. It's a little harder to maintain things. But that said, a lot of the community efforts have maintained many, many packages, good packages over the years which are central to the work we all do. And related to all of that, of course, is I think the ultimate in openness is open access publication in the sense that if you make scientific discoveries that are not in open access publications, then people can't read them. They don't know it turns your thing. If there's behind a paywall, that means only a certain people can look at it. Many of us in this room were probably privy to sort of free versions of Nature, Science, Cell and so forth. But there are a lot of people out there that do not have that or may have that only when they're within the walls of their university. They don't have it when they're somewhere else. And there's lots of stories of people vary that from big universities being stuck in hospitals trying to find out what sort of making medical decisions on paying $35 an article to being able to see, read papers, and so that they can make their decision. So these three things together, sort of open source software, open data and open access publication are really at the core, I think, of computational biology and bioinformatics. Not everybody agrees with me. Maybe not even all the instructors in this course agree with me. But I think it's really what has sort of sets this field apart in a way that's quite unique. So I'll get up my soapbox now. So bioinformatics reagent. So I really think of databases as one of the reagents that we need to use in doing the bioinformatics work. So it's a sort of way of thinking about it. It's an organized array of information. It's a place where you put things and if all is good, you can get it back out. There are bad databases out there where you put something in it and you can't find it anymore. So those are not good things. And so this is something that, you know, sort of when you build a database, it's something you test is the ability to sort of get it back out. It's also ideally it's a resource that other databases can build upon. And so a model organism database, for example, is a great resource for let's say a human reference gene database that it can sort of infer function from similarity to what's been work that's been done in the mouse, let's say, or in worms or some other model organism. It simplifies the information that you're working with. So it allows you to sort of look at it and make sense out of it all. And ideally it allows you to make discoveries. So if you have a database from which you can actually make discoveries, there are many scientists that, that's all they do. They just basically make discoveries out of other people's data that they've missed and didn't look at carefully and so forth because it's there because they're using the right tools. And so an important thing when you're looking at a database is what's the data model? How is this data modified? How is this inputted into this? What is the sort of guiding structure of how the information is organized in a given database? Why does a version number change? What is, if I have two entities that have the same name, is there one called A and the other one called B or they both call the same things and you can't make a difference between them both? If I update it something, what happens to the accession numbers, what happens to the version number? All these things, if I modify it, I change it, I delete a nucleotide and so forth, all these things are important things to understand. So a classic bioinformatics experiment and we do experiments in bioinformatics, people don't think about it that way but it's definitely that's what it is. You have reagents that you put together, you have sequences in databases, you do a search, you decide which method you're going to use, you're going to do a protein against nucleotide, nucleotide against protein, or translation of nucleotides against the translation of nucleotide of the database and so forth, whichever one you decide and so those are the various flavors of BLAST, P, BLASTX and so forth. And then you get an alignment, so that's an interpretation. You sort of have similarity scores, you're testing on hypothesis. So this is really sort of all the hallmarks of doing an experiment. So you have to know your reagents, you have to know the methods, and you have to do your controls. What kind of control can you do in a BLAST experiment? A classic control could be a sequence that you know is in the database. Do you find it if you search with a similar sequence? If you don't find it, that means your parameter is probably for your BLAST search or wrong. If you can't find something you know is there, that's probably an example of a control that would sort of change. So as I mentioned, so we're all sort of bioinformatics citizens here, and what does it mean when we find things? I wrote several years ago, I wrote a little letter in nature to sort of implore to the community that when you find a mistake in a database, report that mistake to the database owners or curators. I mean, that's there, if you don't do that, then somebody else is going to find a mistake and they're going to curse the database again, then somebody else is going to curse the database again and again and again. But there are actually many databases that have curators that work for them, so if you report that error, it'll get fixed, then it won't happen again. It will be solved, the problem will be solved for the next user. And these are public resources that we should therefore sort of take advantage of and share alike. So database is sort of, you can sort of think of them as sort of different layers of complexity. So you have data, so for example, a jambac flat file, a cosmic record, cosmic we'll talk about a bit later, interaction record, a protein-protein interaction, title of a book or a book itself, so that's your data and your database. The storage system could be boxes, the sort of simple system, Oracle, which is a commercial relational database system. MySQL is an open source relational database system. It could be just a bunch of files or a Unix text file or a bookshelf, all examples of sort of storage systems. So a query system could be a list you look at, it could be a catalog, index files, it could be SQL with structured query language, or it could be the graph command in Unix. So those are the various ways of querying the system that you're looking at. And the information system, which is like the big thing that we're looking at, is relatively complicated sort of database organized thing. One example would be the Library of Congress, Google, which is NCBI, sort of information system. Ensemble, which is EBI, sort of information system, and we are organizing genomes, UCSC Genome Browser and ICGC. So you can think about ICGC, the International Cancer Genome Consortium, which we're going to talk about today, as sort of a relatively complicated sort of information system that has all sorts of parts that you need to understand and sort of follow. NCBI actually tracks a lot of, and that's been part of their strength, is that they not only track a lot of various types of databases, but they interconnect them quite readily. So one of the sort of central piece of the NCBI data model is that DNA makes RNA, makes proteins, then you write a paper about it. They very tightly link sequences to each other, so protein sequences to DNA sequences and RNA sequences. They also tightly link protein sequences against similar sequences, so anything that sort of has a similarity score will be what NCBI calls a related sequence. And so if you look at any sequence at NCBI, you can see related sequences, those are already pre-blasted basically sequences, and so for which you know the similarity. And all of these are talked about, mentioned in publications, so they do a very big job of really keeping track of publications and really ensuring that the connectivity between sequences and publications. And then that becomes really, as we all know, is basically where all the, a lot of information about phenotypes, about sort of metadata, about the experiment, about all sorts of other types of information related to the things that we compute on and work with are really embedded in publications. So that link to the publication is really important. And so NCBI, so these are numbers from 2011, and actually I also have some numbers from 1999, and the first year I gave, we gave this course, and the first Bioinformatics Workshop was in 1999, so it's 15 years ago now, sort of ages you a little bit. So, and if you look at, let's see, I think, yeah, if you look at the next slide, for example, they, in this sort of 12-year period that's on the slide, is about a 32-fold increase in the nucleotide number of records in GenBank. 63-fold in the number of protein sequences, 23,000-fold in the number of SNP, DB SNP records, and so forth. So all of these various types of databases, some of them you may not know, and I invite you to go look at them and sort of see what, you know, the differences, for example, between PubMed and PubMed Central, the different, what an OMIM databases, does everybody know what an OMIM database is? Anybody not know? You don't know? So it's the online indelient inheritance in man. So it's a John Hopkins database that is now hosted at NCBI, which tries to keep track of all the disease genes in the human genome. So it's actually a very, so it's curated by clinicians, and so it's a really good resource if you're trying to understand so many of the cancer genes that we're interested in will be curated as such in OMIM. And so it's a sort of really sort of good, sort of standard resource for human disease genes. And so that's just one example, and I invite you to have a look. And so to get a hold of these numbers, you can actually go to the entrée homepage where you can query all these genes, and instead of typing your favorite gene, you can type in all with filter and square brackets, and what it will give you is it will give you the numbers for how many records there are on each of these databases. And you can go look at each of these databases and see what kind of records they have. But it gives you a sort of a landscape of all the things that NCBI has and how many records of each type they have in these databases. And they're all retrievable and extractable and so forth. There's obviously a lot about a lot of bioinformaticians that have made their careers in reformatting things and taking things from one format into another format. That is not the best way to spend your time, but it turns out that it's something that a lot of people do anyways. And so I'm not going to encourage you to do that, and it's not that much fun actually, but I'm just letting you know a lot of bioinformaticians have done that in their career. And so, for example, converting, you know, GenBank flat files into FASTA files and trying to do the very complicated things of the other way around too, or embedding enough information in a FASTA file that sort of carries all the various idiosyncrasies of the GenBank flat files. This week, though, we're going to talk about FASQ files, which are basically DNA sequences from sequencing sort of machines, and they are derived from FASTA files, so understanding the sort of the history of the file format sort of helps a little bit. So we're also going to talk about SAM and BAM files, and so those are alignment files also from sequencing experiments. And then we're going to talk about variation files, which are ICGC has a format and VCF is a sort of more standard format. And then there are many, many more. They're actually the URL at the bottom here is it from the UCSC genome group that has a list of all the various types and definition of all the various file formats. It's a very useful reference. I'm just going to spend a little bit of time on GenBank flat file because I was actually in charge of GenBank for five years at NCBR, so it's a file format I know and love. But it's not as much use in it, but it does carry basically a lot of our knowledge. So a lot of our knowledge from our understanding of sequences has been captured in GenBank because people have sequence, have sequenced their favorite gene, have submitted it to GenBank, have written a paper about it, they've stuck in an accession number that then references this CDNA or this genomic sequence that has information, understanding about the function of the protein encoded by a CDNA, for example. And so there's a lot of that kind of information that's been, is in GenBank that has been, is now being used by RefSeq, which is a derivative of GenBank which is now used in all the genome browsers and so forth. So all these things come from basically 20, 30 years of molecular biology and understanding of doing sort of bench experiments and so forth that then end up in GenBank flat files that end up in protein sequence databases and on browsers and so forth. So GenBank flat files got a header where it's got title, taxonomy, citation. It has features which is all the immunoasset sequence often where there's a gene encoded in that segment of DNA and the DNA sequence itself. And a FASTA file is actually, it can be either nucleotide or protein and it's basically a greater than sign, a string of anything, and then a sequence. Then the next FASTA file would be a greater than sign, a string with anything on it. So the requirement for FASTA is quite minimal. NCBI has put a lot, and EBI has put a lot more structure in their FASTA file in the header after the greater than sign. They sort of organize it. All their files always have sort of a database name, a pipe sign, an ID, a pipe sign, another database, another ID, and then a short description of what that protein is. So this is a FASTA of a protein nucleotide and I remember when we were at the NCBI we had a lot of people from computer science and physics and so forth and we sort of explained to them the difference between RNA and DNA and protein is, well, if it's less than 85% of ACGT it's probably a protein. And so you give them an algorithm and they sort of figure it out and they were right most of the time. And so that's, you know, obviously there's not that much information except the biologists will read that and we'll see, we'll read, ah, it's from yeast and it's got, it's a protein and it's doing its GCN4 so it's your regulatory protein I know and the control was a general control of nucleotides it's amino acids, amino acids, yeah. And so that's the, but a FASTA file could be as simple as just this. So there are two sort of major categories of databases we think about sort of primary, sort of archival type databases and secondary or curated databases and some of them unfortunately fall on both sides because they're actually both archival in nature but they're also curated and so GenBank is sort of considered an archival so GenBank, what I mentioned GenBank is actually GenBank EMBL DDBGA which are the three international databases Unipro, the protein databases both a primary sequence because it's simply derived from primary data but it's also highly curated Medline is PubMed is also considered a primary database intact as a protein interaction database EGA and ICGC there's primary data that's archived there so, but there's also like I mentioned OMIM as a database where there's curation takes place MGDs, anybody know MGD? It's a very important model organism database for all of you mouse genome database, yes and TAXON is a taxonomic database for all known organisms for which we have a sequence so they have an ID but that's the best way to identify an organism is with a TAXON ID Yes there means usually that there are some curators that work at the database that sort of make sure that it's in better shape than what was submitted to them by wherever they got it from so there means that usually what we have we call them bio curators now it's a new term in the last few years and bio curators have been involved are basically scientists it's a different career path for scientists so you have PhDs that do bio curation that usually work on a database and read the literature make sure that everything that's in the literature I mentioned earlier that things in literature which don't happen, don't show up in the record but bio curators will make sure that everything that's in the literature does end up in a record that you can, is organized in a way that it makes it retrievable it's often metadata about an entity of some kind how for example the experiment that was done to derive the knowledge that came out that's done by people reading papers and adding to a database so a lot of the mobile organisms have their database a lot of things in RefSeq so RefSeq is NCBI's version of GenBank basically where they take the best so there might be let's say 10 records in GenBank for beta-globin they'll take what they think is the best one they'll make it a RefSeq record they'll call it their own and they'll curate it in a standard way the same way that they curate all other 20,000 genes in humans so it's uniformly standardized way of curating so that people that compute on that will now know how that information was derived so NAR publishes every year sorry another question yes, sorry so there's actually evidence codes evidence codes and you'll look for the ones that you're saying experiment with your body that's sort of your bar that you need to publish something that you can go look for in those codes often they won't have it but go for example everything that's kind of a raw innovation in gene biology so RefSeq will have many so RefSeq will have go terms attached to it and these go terms will have especially with the RefSeq and everything that's kind of the same no so there's different so RefSeq so if there's a transcript if there's a message RNA that means somebody did an experiment in sequence there are no synthetic sequences in RefSeq so every sequence is derived from some organism and every transcript is derived from a tissue from some organism so with RefSeq if there's a tissue that it came from so it knows from an organism but if it comes from a certain tissue a certain developmental stage that information probably does come from a paper and will have been curated by somebody a message RNA by itself is an experimental result telling you that a transcript was found that looks like this and that made it look like a transcript so that's an experiment with itself but the fact that it's from that tissue or that developmental stage that usually comes from the newspaper or from the souvenir just check okay so NER has once a year has a database issue in January that has about I think so far they have something in the order of about 1500 databases and every year some of the top tier databases things that come from NCBI or EBI or the JGI also on the west coast the OE funded labs those get mentioned so their databases get they're allowed to sort of write an update every year but most databases only are allowed to do an update every two years and so never if they don't do any they have to justify being included in the NER database issue so it's sort of become the standard to be to be a top notch databases to be included in the NER database issue and so here I point to you to two years worth of database issues if you want to have a look at all the new databases that are still sort of alive and well you basically have to look at a couple years worth and this is an example of NCBI so NCBI does one they'll have a paper on GenBank but they also have a paper on all the other databases that they put into one paper and that's an example of such a paper which is in Nucleic Acid Research which is an open access journal free to download what that goes without saying so what is GenBank? so GenBank is the NIH GenAx sequence database of all publicly available DNA sequences and actually when the first draft of the human genome was done that was put into GenBank and and many other other Jim Watson's DNAs in GenBank Craig Venger's DNAs in GenBank so it is sort of a public repository of open access sequences so I mentioned DDBJ DNA which is European Nucleotide Archive and GenBank so those are three, they're actually the same database everything that's in GenBank is in the DNA or used to be called EMBO and everything at EMBO is also in DDBJ because these databases exchange every day what is submitted to one then it ends up in all other free databases daily if you submit to EBI it ends up in EBI but it ends up in GenBank but if you want to do an update then EBI has to do the update so wherever you submit to in the first place that's where the updates and so forth are done but each database has their way of querying and looking into things and doing searching and blast and fast day searches and so forth this is a figure from Ralph Poiler in the textbook correct so it's actually the advantage is probably to do it in the one you think is closest so from a submissions point of view probably you're here to submit in your time zone because you actually probably need to communicate with them and so your authories do it at 3 o'clock in the morning so if you're in North America submitting to NCBI it makes more sense from a user's perspective I think it's from whatever services you like to use and whatever you think is connected if you like ensemble a lot then you go with EBI if you like all the other NCBI products then you go with NCBI but from a data point of view downloading the database they are the same in all three except the format there's a lot of people here who are seeing an ensemble format into Jet Bank format it's a waste of good people time so Hibana that's the whole difference that's another level of curation that's another level of added value which is unique to one product not the Hibana gene products don't usually make it into I don't think they make it into the use of JT browser making it into NCBI and the NCBI product that's specific to EBI and actually one of the things that I'm not going to talk too much about that I'll just quickly mention one of the difference between UCSC EBI ensemble and NCBI's browser is that at the framework the sequence coordinate system is all the same they all agree on one coordinate system but it's how each then decorates or annotates the genome that's where the difference is and some are more liberal let's say and some are more conservative NCBI is more conservative they are less likely to put in a gene model unless there's some good evidence for it and EBI will have more potential gene models that could have been only derived from some gene prediction software they'll have special codes for it so you know it's a gene model that doesn't have as much evidence behind it but they'll be included and so they may lead you down the false track or may give you some insight that you wouldn't have had otherwise so it could be taken both ways so there's lots of different kinds of files in GenBanks that are sort of the gene I mentioned earlier the one gene investigator so their favorite gene they've spent their whole career working on this gene they've submitted the sequence to GenBank and it's well annotated it's lots of knowledge lots of validation biochemistry active sites of the enzymes all that kind of stuff and those are the gems I guess in GenBank especially today that came from large genome centers that just sequenced the whole genome did an assembly and predicted where the genes were and did some modeling and so forth and put it all together and the modeling came from EST where again EST expressed sequence tags are transcripts which are short and not full length to give you sort of models of where genes could be and so forth and these things have to be taken with a grain of salt and there's a lot of things which are unfinished so they'll do a 2x or 3x coverage of a genome and they'll try to assemble parts of the pieces and they'll submit it to GenBank and that'll be all you'll ever see from that organism and so you won't see that for humans so we're lucky to work with the human genome in cancer genomics because it's a very much better understood genome because of all the efforts and so forth that have been sort of applied to it but it's been a very there's a whole range you have to keep in mind there's a whole range of things in GenBank their files are put into these divisions and it's really historically that was sort of a file size limitation problem so that they couldn't deal with large files and on vaxes and so now it doesn't make sense to have file size limitation but they've sort of still they've kept those file size limitations but really I think of GenBank in sort of two categories of divisions there are some which are functional which are actually very useful to understand and very useful to use and maximize on and then the organismal one which don't really make much sense because actually there's a lot of differences between ENA and DDBJ and GenBank what they put in those divisions and so it's not really well done and really for organism if you want to separate organisms you should use taxon IDs you should not use GenBank divisions but to understand what let's say an EST an express sequence tag which is a short read from a CDNA and work with and do use tools which are specific for ESTs that makes sense so to extract the human ESTs and to assemble them and to gene models and so forth is a useful way of dealing with that sort of that division so in GenBank things are put together for various reasons and understanding why this is so is really useful to take advantage of this database so another big thing is identifier so if I look at this list here so a GenBank record starts with a locust name the locust name is probably historically the rule is a locust name has to be unique in the database the rule is not that the locust name has to be the same in between databases and historically locust name was something that people memorized like s340 so the first time people sequence s340 the locust name for that one was sb40 your favorging global as global it's a derivative of the word global and so you wouldn't memorize although I haven't but now we have millions we have hundreds of millions of records so nobody memorizes this is in the old days when there's only like 100,000 records in GenBank I can memorize a few of them but now there are hundreds of millions and so it doesn't make sense to memorize anything and so locust name is awful because it's actually not maintained between the databases and it's not very useful the locust sorry the locust yes so the accession number so now that is a useful one that is an important one in the publication and so I will sequence a gene and I will reference the accession number and what's happened since then is that we actually now have a version of an accession number so a version means that for GenBank a version goes from u12345.1 to u12345.2 that means the sequence change it could have changed by one nucleotide databases it's still the same record change by sequence and so it gets an increment of a version number if you annotate or change the features or anything like that on a GenBank record that does not change the version number the only way a version number or an accession number changes is when the sequence changes so that's a good understanding important thing to understand but if you have a protein so one reason maybe they re-sequenced that's why the most that's the most common examples of why a sequence would change but let's say they are doing bacterial sequencing and they sequence two genes in the chunk of genomic DNA then later on it's discovered there's a third gene in the middle that they haven't annotated they'll go back and put the annotations to show that there's a third CDS coding sequence in the middle between the other two that were missing so they didn't change the DNA sequence and therefore the version number or the DNA sequence doesn't change but the amino acid sequences they have a version number and so the first one and second one are the same but now there's a new one in the middle that didn't exist before that now is associated with this DNA sequence as I mentioned NCBI keeps track of the relationship between the proteins and the DNA that's an example where it keeps track of that so it's not but let's say there was actually two transcripts but it's actually an open reading frame that's actually interrupted by whatever reason by a sequencing error it's actually one open reading frame so you had two accession numbers but now you only have one that became point two and then the other one got removed and that's how they're all these kinds of corrections and people find similarities with other organisms and they say oh in this organism that's one open reading frame why is it two open reading frames and that one is it a sequencing error they go back and re-sequence it or check it and it is indeed a sequencing error so historically before Jan Bank had accession numbers it had GI numbers which stands for Gen Info the GI number is the same logic as accession.version except that it's just an integer so if the integer changes then that means the sequence changed yes no that's a good question lots of people have submitted sequences which are never published so there's a reference block on every Jan Bank record that the default reference block says it was spitted by such and such and so it tells you who it came from but not necessarily a publication so if there's a publication there would be a second one which will have a full citation part of the record see what time ok so protein have GI as well and protein have IDs as well which also have version numbers it all works the same way so if you go to NCBI and you can actually look at the revision history and you can put in let's say the top two records there have the same version number which means the sequence didn't change between them sorry they have two different GI's which means they do have the sequence changed and if you look at the diff between those two files and it shows you the date change but the number of nucleotides has changed also it used to be 106,330 and it got 106,210 so it lost 120 nucleotides and so because of that then the GI number changed so the accession number space so accession numbers are usually one letter plus five numbers then we ran out of space so we ran out of letters and numbers so we started using two plus six then we were running out of those and then the whole genome sequencing which are not distributed with GenBank but there are four plus two plus six so it's four letters plus two numbers plus six numbers so it's a root two numbers and proteins are either one plus five or three plus five so if I look at any accession numbers I actually know if it belongs to a protein there are different structures and they all now historically they didn't but now every record in GenBank and every protein in GenBank has got an accession dot version number so what about human stuff so there's the genome reference consortium which is historically basically tracks now it's a collaborative multi-institute group that sort of has agrees on a set of coordinate system that everybody uses for hanging their annotations on what annotations they add like at the EBI versus NCBI versus USCSC or versus anybody else will differ but they all agree on the human genome reference coordinates so for humans we're now at GRCH 38 so version 38 and what you'll find is that many browsers and many groups won't be at 38 they'll still be on 37 because to go from 37 to 38 it takes years sometimes and it takes you have to re-compute everything you have to re-align and so forth it's quite complicated so I think on the next slide yeah so on the next slide 38 came into power on December 2013 so just last December it had been almost basically five years since the last update so between 38, 37 and 38 and only three years since so the updates are coming slower and slower now because they're first of all we're getting closer and closer to being finished we're actually not finished the human genome I'm surprised I know it's a shock to all of you but we're not done there are still gaps there are still things which are unclear but it's 33 billion bases that we have to organize and so forth but it will be a few years for example on the ICGC people was just asking me recently while you guys using GRCH 38 yet no we're gonna because we want to we have to impose it across the board to all the country participating countries and it's gonna take them a long time to re-jig their pipelines and so forth to be able to work with GRCH 38 and so one interesting thing that happened with this last release is that UCSC which had increment you know went from HD1, HD2, HD3 HD19 so now it went from HD19 to HD38 so that HD38 would match GRCH 38 GRCH 38 was only there for three releases before that it was NCBI 36.135 NCBI built 36 became GRCH 37 so that one is basically it wasn't NCBI anymore it was a group of people that sort of agreed on this coordinate system but it actually goes back to 2002 or 2001 when UCSC switched from HD8 to HD10 everybody agreed to take NCBI's build so you can imagine before 2001 everybody was doing their own build everybody had their own coordinate system and everybody was trying to compare their gene with the other people's gene and that wasn't sort of impossible it wasn't necessarily the perfect solution but it was the same solution for everybody which allowed it to compare models from one browser to another browser to another system that was it really so right now most of the things we work with are still in HD19 some of the things that we may do this week maybe with HD18 and keeping track of that will be very important another concept is bioprojects which is basically a initiative but EBI is on board also is keeping track of larger things when you have yes well yeah except now the last one took 6 years to come almost 5 years the other one before that was after 3 years the thing is it's not finished yet there are still the ones that's finished that we've been doing corrections so the problem is well there are many problems the problem is is that this is considered the reference and the problem is there is no single reference right especially if you start thinking what's the Japanese reference versus the African reference versus the Russian reference those don't they're different references and in cancer biology we do mutations in our in the cancer genome compared to the normal genome so we use actually the other one but the assembly of the normal genome is actually made against a reference and the mutations in the cancer genome is also made against a reference so we're comparing against this reference which is a bad reference to start with and then we're making some inferences so references may and all of that we're using right now because we don't have any really good de novo assembly program so we don't have any good way of taking a raw sequence reads assembling it into a finished genome without actually using a reference we need a reference right now because that's the best solution but technically it's not the ideal solution the ideal solution would be to do de novo assembly and so once we have that worked out once France and Jared's group and so forth sort of figure that out then we won't have that won't be a problem anymore so we will forget references we'll just do de novo assembly we'll just figure out the right true reference for the cancer genome will be the normal genome and that will be a simple solution yeah thank you Michelle Michelle's telling me to hurry up so biosamples so quickly so all of this is to make discoveries in cancer biology with respect to all the various data types we have access to so we have simple somatic mutations we have methylation marks we have gene expression alterations and we have a lot of structural variations small insertion and deletion so we have a whole slew of things that we're trying to figure out so that we can understand what is happening in the cancer genome so things in the cancer genome space have changed quite a lot in the last few years initially we got our understanding from looking at short sequence tags which gave us an idea where the genes were and what the genes were and where they were and then we used those to map and to also assemble the genome for which we didn't have a sequence yet this is before the human genome was sequenced we got lots of insight into polymorphisms how the worldwide changes exist with respect to population migrations and so forth and involved in genome wide association studies which was also linked in many cases to diseases which allowed us to discover disease associated genes then there's the infamous Homer paper which is a paper that sort of demonstrated that when you did a case study of let's say comparing a normal group versus a schizophrenic group and looking for variants in each of these groups if I only had a few variants from you and I knew you were part of that study I could quite with a good probability highlight which of the two groups you were associated with and so all that data was open after the Homer paper NIH shut the sort of WGS data down and made it controlled access data and basically the whole concept of identifiability of individuals belonging to a certain group became a very sort of critical ethical issue that have encumbered the openness of data with the caveat that it's important for this data to be encumbered and to be controlled access and so that only people that sort of sign off saying I'm not going to do anything bad with this data and I'm not going to do I'm not going to try to re-identify this individual and I'm only trying to do this so I can get a better understanding on this disease or what have you this kind of stuff is really important and we will but all of that happened with the Homer paper in about 2007-2008 after that there was a cancer genome atlas pilot project the 1000 genome project which is actually an open project which is some problem with that too the cancer genome atlas full project and the ICGC so the ICGC includes and it also has so the cancer genome atlas is the American part of this international large scale project and that really the ICGC and some of the work done at Hopkins and so forth is really what led to this ICGC database basically it was the identification so the analysis looking at exomes of many tumors and looking at the variation that exists and the similarities in some of these samples is what led us to the understanding that we need to do a lot of these and we need to do it large scale if we're going to get any insight as to what's being sort of hidden in these things and so both of these the Hopkins initiative the welcome trust work and the TCGAs or pilot project were really at the foundation of the ICGC so the people that worked on that were also sort of the founding meaning that led to the ICGC and the lessons learned from that work was that there's a lot of heterogeneity across tumor types there's a high rate of abnormalities and drivers versus passenger mutations and then the sample quality matters i.e. how you treat your tumor how you harvest it, how you store it how you process it and so forth was really key and those are some of the things that when we started the ICGC we really took care and using that information the ICGC had basically one goal basically sequence tumor normal pairs from 500 individuals for 50 different tumor types and for each of these to look at the genome, the transcriptome the methylome and the clinical data and to actually follow the clinical data over time so that we would have then could use the genomic information as a to use be able to predict outcomes of the disease and basically to make this data available to the research community and to the public and so the this ICGC is a humongous project and so it could not be done by any one country and so it really took the establishment of standardization of quality measures and the merging of data sets and to increase the power and to have multiple data sets from all over the world what made it possible and what we wanted to do is we wanted to basically keep track and have information about the projects about the patients, about the tumor about the samples and how they were collected about how they are extracted and how they were manipulated about the sequences about the analysis of these sequences and about the interpretation and so there was all such things that basically we were trying to capture in the ICGC initiative so this is an update of the figure that Andrew showed you earlier so we actually have as of May 2014 we have 71 projects, we have about 18 jurisdictions because we have why we call them jurisdictions some of them are like EU and some of them are France and UK working together and so forth so we have 18 teams or countries mostly we have 42 cancer types of the 50 that I mentioned for which we received sequences so far and we have over 10,000 cancer genomes for which we have some information we have promised in the order of 28,000 so we have basically all the samples are all queued up for the completion of the project so we're supposed to have 25 or we're going to have something like 30,000 genomes at the end which means more than 60,000 genome because they're all tumor normal pairs right? so it's a lot of genomes to be able to handle the OICR is actually the headquarters ICGC the data coordinating center for the ICGC so this is our growth curve of our tumors from which we've been receiving data over the last few years and so it's standard growth curve it's happening all the time the ICGC.org is the portal for all this data and all the information about this data if you if you go click the projects you can see some details and you can go further back into the projects and see what information they have on the top bar is you have clicks for the data portal the DACO which is the data access coordinating office information and login and so one of the big things we have is that we have now two data sets we have the open data that we can make freely open and because it's not identifiable it cannot be used to identify an individual and then we have the controlled access data which is by itself identifiable some surgical details some clinical details are considered controlled access but the biggest piece of the controlled access data would be raw level DNA sequences at the transcriptome or at the genomic level and so a genome file a BAM file and an alignment file all of these files are considered controlled access and therefore cannot be released freely available to the public what we have on the DCC portal if you go there now you don't have to login or anything like that access to all the open data so all the open data means you don't have to login and so we have at the top bar we have three categories so we have the cancer projects I mentioned there are 71 cancer projects the advanced search where you can do what we call faceted searches I'll talk to you a little bit about that and the data repository so this is if you go to the cancer project page so it has a page right now if you want on your computer you can sort of click a little bit not certainly as much detail as you would need to so for example if you go to dcc.icgc.org you'll have this pie chart that if you mouse over will show you all the various parts of the so within a pie slice will be all the tumors from a tumor type and then when the pie slices sort of split up on the outer circle that's all the different projects which have the similar sort of body part coming from different groups and it gives you an idea of the numbers for that if you go to the and that's the same pie chart that's on the projects page because that's the overall but on the left hand side you have if you go to the advanced search on the left hand side actually I have it both on the left and on the right hand on this image you have what we call faceted searches so you have the ability to go look and select which parts of which tumor types you want which stage of the disease you want which gender which so you can sort of select and as you click it selects that and it updates the pie chart it updates everything it's all very dynamic and searching that's made possible through this very well engineered back end to this database and so if we go to so all the advanced search all have a download the data so you can go select which data sets for a given donor or group of donor you want to download then you can have that on your desktop to go through further searches if you go to the repository you can also go to any specific project and actually download all the files from that project for example the example I have here is all the somatic mutations from pancreatic cancer that's done in Canada which is actually done at OICR so OICR is doing a pancreatic cancer project and it's generated some somatic mutation data which is available in the repository so all this data is open access because somatic mutations are not identifiable they're not germline variants okay yes it's very similar yes yes it's very similar so except one thing I'll have in a slide maybe I won't have time before lunch well maybe I have a slide later that they don't trust their somatic mutation callers so TCGA doesn't we do but they don't so the reason is so why should I not trust a somatic call variant caller so so what happens if there's a mutation that falls on a on a variant that is identifiable so let's say position 2,333,000 that's a very rare snip so very few people have that snip and if you have 10 of those rare snips I know who you are but if you happen to have a mutation of that position and you declare a somatic mutation so to know that it's a somatic mutation you have to know what's on a normal allele so if you declare the normal allele and you declare the snip and the somatic mutation then I know that and I know that that position is where a variant exists if I know that coordinate system then I'm able to maybe identify the individual so when we discovered that a few months ago we actually decided to filter out all the mutations that land on known snips so at CGC we make a separate file that's only available through controlled access file which is a somatic mutation file but that lands at position where there's a variant so it's rare it's a few hundred and a few thousand for individual if you have a million mutations there might be a hundred of those snips that fall where there is a known snip and it becomes identifiable if you declare what the normal variant is for that position so what we do now is we actually lie we don't tell you what the normal variant is we tell you what the reference variant is so you may have a different variant at that position but that's sort of a half lie but it's a lie but what we do is we document it and we let you know and then if you get controlled access permission then you can go see the real data it's known it's a germline variant that's why we're hiding it so a lot of clinical data like gender age of diagnosis which is for old people is actually a period it's not an actual age so it's between 85 and 95 or maybe a smaller bracket those are not identifiable but it's because we've modified them to be sort of a bit loose and not to be identifiable yes you have to give away your firstborn so basically I'm going to go over it a bit later after lunch but basically you have to fill out a form you have to sort of say you may why you didn't get a review ethical review board you have to declare your computer infrastructure to make sure it's secure and you have to state why you want to have a look at this data and then that gets reviewed by a group of bioressicists and they sort of grant you or don't grant you access and it's also this document is also signed by somebody who can fire you so that should you digress from any of the rules that you signed up on saying that you will never re-identify this you will never share these files that anybody else is not authorized to look at them and so on and so forth then that person we can go say that person violates the rules you should fire this person and it doesn't happen yet but technically I think it's put the scare of whatever into these people and so the whole discussion about people not abiding by these rules is a whole debate of a lot of bioethicists but basically there's a form you have to fill out yes it has to get signed by and it would sign off on your grants or sign off on large for your institution and then it gets sent to the DACO office they review it and then they grant you access and they give it to you for a year and you have to renew the renewal process is a lot easier but you have to go through the renewal again every year so it's a one year permission it's usually done at the lab level so our lab so ideally so historically NIH does it at the PI has to do it and the PI has to login in the password and then the PI gives a login in the password everybody in his lab or her lab and that's not we thought that wasn't a very good idea so what we did is we actually made it very easy for a PI to add and remove people from his or her lab to the list of people who has access to the data so that each one of them has their own and password basically to access this data and if the PI you have a summer student you can add somebody for three months and when they're gone you take them off the list they don't have access anymore so yes so right now every project it's a big problem and so this is why Andrew referred to the TCGA pan cancer analysis where we're doing in this case we're going to take 3000 genomes we're going to apply the same pipeline to all 3000 genome and we're publishing this pipeline so we're making these pipelines publicly available so YCR is in the process of making all of our pipelines available as well but that's only for us not for everybody so everybody's going to be writing papers and they're going to reproduce their pipelines now publish their pipelines as they do but we're everything into a VM virtual machine you can download and run, put your data into it then run with the tool it has all the tools with all the parameters and everything that we use so we're going to do that for the ICGC pipeline that we're going to use for the pan cancer project it's all we're benchmarking and I'm not going to report on the benchmarking results but then I'll publish it but it's not pretty I can say that much if you get me a beer maybe I can tell you a bit more no I won't tell you more but it's not pretty so I have see what do we have here oh yeah so I have this file this is a file I downloaded actually from ICGC repository it's on my computer now I can look at it and I'm sort of asking you to do that and see what you glean out of that file so I'll give you an example of once your file is on your computer and I'm sort of showing you which folder it is on my computer but your computer will be a different folder and have a look at that file and see what you have so you can do that later and during the lab time so I mentioned data nomenclature and metadata and it's critical for everybody to work with and we've found mistakes we should I talked about that with respect to GenBank but basically that's true for any database and so every database will have an email address you can email too if you have problems you can't find something or so forth and it's gonna make more sense actually this is probably a good place for me to stop right now so it's 25 after the hour we have lunch until 1 30 so for an hour and 5 minutes and we'll get back and we'll finish the second half of my lecture see if I'm actually halfway through not quite but I can talk fast almost halfway actually no it is actually almost halfway yeah so page 37 out of 70 something it's actually exactly halfway okay so we're gonna break for lunch until 1 30 and as a reminder there's this building is connected to two other buildings on the ground