 So, I'm going to be talking to you about databases today, so Trevor, in the first lecture, did a very great overview of all sort of things having to do with cancer genomics. I'm going to take a deep dive into sort of one aspect of it, and I'm going to focus actually on really one big database that we use, but I'm going to give an overview of databases as well, but it's going to be complementing sort of Trevor's initial lecture in getting you to think about cancer genomics. Also, as I mentioned in my introduction, I'm at the Genome Quebec now, which means that I'm not in a cancer institute or a cancer, but the last 10 years, I was Associate Director of Informatics and Biocomputing here at the OICR, and so I've been working, thinking cancer genomics for the last 10 years, and I haven't forgotten it, I think. And so, even though I've been gone for a few months now, sort of just to put it as disclaimer is that I'm going to mention products like we will mention, for example, Amazon this week and so forth, and I don't have any shares in Amazon, so I don't profit from mentioning or supporting Amazon, except they do support us, and so I really want to acknowledge that. And so, but the problem with Amazon, as you will soon discover, is that it's a bit like this week, you'll be using Amazon Web Services, and it's going to be free. And so you're going to go home, and you're going to have to pay, and you have to use your credit card to use Amazon. So just keep that in mind before you get addicted to Amazon. It is a good product though, it tastes real good. I'm also a former NCBI employee, so I worked at the NCBI in the 90s. And so I'm a sort of big advocate of what they do, and so that sort of comes through in the way I support things. And like I said, I was also OICR, and so I'm definitely going to talk about some of the OICR products in my lecture. So that's my new email address. I'm on my Twitter handle and Bioinformatics. Bioinformatics.ca is the workshop Twitter handle. So we're going to have objectives that we're going to review databases used in Bioinformatics and used specifically in cancer genomics. And we're not going to do all the databases. Trevor did actually a really good job of touching many of them. I'm going to focus on a few of them, which I think are, you need to be aware of all of them. You need to be aware of many of the resources. There are a few resources that link out to many other resources. Those, I think, are really sort of the key ones. And that's one of the ones I'm going to talk about. And we're going to talk a bit about visualization. The after lunch, Ann is going to do a lab on IGV. So Trevor talked about IGV a little bit and showed you some screenshots from IGV. Yes, you're going to use it this afternoon after lunch. And it's going to be a key tool to look at genomic data and where to get and do stuff. So at the beginning of all this, as I mentioned, I've been doing these workshops since 1999. And they've lived at different homes. And whenever I've moved in my career, I've taken these workshops with me. And so at the core of these workshop series is to talk about bioinformatics. And this is what you're interested in here, I'm assuming, and you're obviously interested in bioinformatics with respect to cancer. And so as a first sort of thing about why we do have bioinformatics, and one answer is that it's to deal with open data from genomics and proteomic technologies. And so the same way sequencing, which has generated tons and tons of data, has required we weren't able to do it by hand anymore. We weren't able to look at sequences and memorize them and use them and so forth. I did know not too long ago, like 15, 20 years ago, I didn't know one guy who knew the first and last ten amino acids of every yeast protein in the yeast genome. And he was able to look at a slide of DNA sequences and in a few minutes sort of identify which proteins were encoded by that DNA sequences. It was Asperger-ish. But that said, you can't do that anymore with obviously with the amount of data. 15, 20 years ago, you could do it with a small genome like the yeast genome. But with the human genome and with the thousands, if not hundreds of thousands of genomes that are available now, it's not you require computers, you require databases and so forth. In the old days of CBW, we used to teach a blast. How many of you have never heard of blast? A few. You never heard of blast? OK. But anyway, so blast is a sort of a classic bioinformatics tool, which is still heavily used today by various websites across the world. And it was invented because you needed a tool to go look and search at GenBank. And GenBank was a DNA sequence database, still is. And the resource itself made it required that you needed a tool to go look for things in that database. And so the advent of an open database like GenBank made it a loud blast to come to be basically. It has to be invented to allow people to search that database. And so it's sort of a chicken and egg thing. But basically the data, the open data required open resources to allow things and make discoveries available. And you have to think of bioinformatics as doing an experiment. So how many of you actually question how many of you have sort of life sciences as your primary sort of starting? You started as a life scientist. And how many of you came started as a more computer science or stats or that profile? So when we do experiments in the lab, we mix things up and we pipet things and we have reactions and so forth. And we do experiments. And so bioinformatics is also you're doing an experiment. And biologists don't usually think of bioinformatics as doing experiments, but you are doing experiments. And so you take a sequence, which is basically your reagent. You add, you have a method to which you deal with those sequence. So for blasts, there are different types of blasts you can do. You can do protein, protein, protein, nucleotide comparisons and so forth. And then you get results and you have to interpret those results. And so you have to know your reagents. You have to know your methods. And you have to know you have to be able to do the interpretation. And you have to think about controls. So what's the type of control? So you know what controls are. So I'm not going to test you on that. But what's an example of a control you would need to do for a bioinformatics experiment using blast? What's an example of a control one could think about? Like a reference to numbers? So let's say that's a good. So let's say you have a sequence that you want to identify. And you know that that sequence exists in a database. So you'd want to do positively a positive control. It would be to make sure you find it. If you know the references in the database and you can't find it using blasts, that means you're using the wrong parameters or you're doing something wrong. Likewise, when you know something is not there and you don't want to find anything. So you want to do the negative control of not being able to find something. So you have to think about bioinformatics experiments the same way as you would a biological experiment. And then thinking about your interpretation, thinking about your controls, what should be there, what's not there. And if you can't find something, is it because it's not there or is it because your method is wrong? So you have to think about those things. So this is very basic stuff. You've probably thought about this a gazillion times. But I'm going to ask you how to define bioinformatics. And so I'm going to do a think-pair-share type experiment. So you're going to talk to the person next to you and you're going to write in a few words, 140 characters or less, write a definition of bioinformatics. So pair up, next person next to you, two or three of you can work together. And this is a think-pair-share type thing. And I'm just going to take a minute or so for you guys to write something down and I'm going to ask around what you think your answer is. OK, does anybody have a want to share their answer? Yes? If you can use a computer slash algorithms to porous analyze the advice, biological data, and answer the pertinent questions. That's very good. Yeah, that's a good one. Somebody else? You got blown away by the answer, huh? At least for a general solely biological question That's another good one. Yeah, it's actually very close to mine. Anybody else? Yeah? System or framework to biological data to make sense out of it. OK, very good. So I mean, there's 40 plus bioinformaticians in this room and so there's going to be 40 plus answers. So that's a given. So my answer, which I've thought about over the years is bioinformatics is about integrating biological themes together with the help of computer tools and biological databases and gaining new knowledge about the system and study, which is basically what all of you have said as well. And so it's important to keep that in mind as a, it's actually not a big thing now as it was, let's say, 10 years ago, where bioinformatics was second tier to genomics and second tier to cell biology and so forth. I think it's a field of its own. It stands on its own quite well, but it also, it's required, I think, for all biology students to do bioinformatics, computational biology. And I won't get into the, is it computational biology or bioinformatics debate? I actually disagree with that debate. I think it's all the same. And then the other sort of sandboxy sort of type thing I want to mention is that I'm really a big advocate of openness. So open data, open source, open courseware. And so the training material that we offer at bioinformatics.ca has been open for the last sort of 10 plus years. It wasn't always open, but then we finally sort of saw the light and we saw it's really, there's not enough of us training people. And so if we can make our material open, it will help others use our material and then train other people. And so we're part of Goblet, which is a global organization that shares training material all over the world. And so, and then open data, if the data is not open, then people can't look at it, and if people don't look at the data, then they won't make discoveries. And so I'm really sort of a big advocate of the more eyeballs on the data, the better. And so you can get, you can make discoveries. With human data, as we'll talk more in this lecture today, it's a bit more complicated because of the ethics and sort of the privacy concerns and so forth. And so there's another extra layer of data access, but we're trying to make that as painless and as easy as possible to get people to look at the data so they can make discoveries and cure cancer and so forth. So it's really sort of an important thing in the sort of the ecosystem that we work in. And open source software is a given. And the biggest one, actually, the most controversial one is probably open access publication. In a sense, it's controversial in that some people don't want to pay for publication and rather just submit papers and get them published and so forth. But when you publish an open access journal, not only do you keep copyright of most journals that are open access, but you then make the material available and discoverable by all. And so there's a small, many journals have a sort of publication fee to publish an open access journal, but then it makes the article openly available and allows people to make discoveries and allows people to gain from the work that you've done. And most funding agencies sort of require it. They don't require it right away, but most of them now make it a requirement for funding. So this lecture is about databases, and so it's about this reagent I talked about and my last example. So understanding the reagent is really going to allow you to make, to choose the best tools to work with that reagent and also to make the best interpretations of the material, of the results of your bioinformatics experiments. And so it's a really key part to have at the beginning of this lecture series this week is to understand the databases, what they have, what they don't have. And when you don't find something in a database, is it because it's not there? Or is it because you use the wrong tool or you misinterpreted what you thought was there and so forth? So it's really sort of an important thing to think about. So databases sort of can be thought about in different ways and there are different examples of all the things that are in databases. The metadata is the first part is sort of the data about the data. So people know that terminology well. And so a create date is an example. So when was this record created in this database? Submitted, submitter, submitter's ID, Orchid ID is an ID to uniquely identify all the people that write papers and so forth. So that's an example of metadata that could be published with a data set and publishers, book files or whatever. So the data itself could be in a case of GenBank. It could be a GenBank platform. It could be a cosmic record. Cosmic is a database of in cancer interaction record. So protein A binds to protein B and interaction database title of a book or the book itself. So then there are the storage system. So a box on your shelf. That's a storage system. Oracle is a sort of commercial database system. MySQL and PostgreSQL are open source, software database relational management system. Binary files, Unix files, text files and so forth or a bookshelf are all example of a storage systems. The query system. So this is how you interrogate your database is also sort of a key thing. If it's a small database, if you have a list of all the things on your bookshelf, then you just have a list of things and you can just look it up. A catalog, index files, SQLs, started query language, elastic search and grep is a Unix tool which I think we'll get to use this week and you can sort of quickly look through a file and look for a string of things. So it's a good sort of Unix type tool that allows you to queries things. And finally sort of the information system. So Altrey for example at NCBI is an information system. Google is an information set. So it's a way that when you do a query in Google you're actually querying the whole information, the Google space and so forth. And there's lots of Altrey I mentioned, Ensemble, UCSC Genome Browser and so forth and ICGC. So ICGC which we'll spend some time about is an international cancer genome consortium and is a database and itself is a whole system. So if you go to NCBI, there are basically it's a repository of multiple databases, multiple sort of query engines and multiple information systems. And so they have a 3D structure information system. They have things to look at ClinVar that Trevor talked about earlier today and there's things to deal with all the sequences and so forth. And so and they're like for Altrey, like for ICGC, they're submission systems. If you're gonna submit to those you have to understand and use. There's training material, there's things for developers, there's things for analysis, research and so forth. The DNA sequences databases actually, it's a sort of slightly complicated system. It's actually three different databases which are all the same. And so GenBank is actually has the same sequences as Ensemble, has the same sequences as DDBJ. And so, and they each have their submission system. They each have, but the sequence you don't have to submit to all three because they actually, they do the syncing of the three databases themselves. And so this is a slide from Ralph Poehler's chapter and this book here. And we're just to show that a sequence submitted to GenBank will end up in EMBL and will end up in DDBJ. And are the same sequences. They will each have their own way of representing that data. If you look at EMBL or ENA, European Neuclid Tide Archive record, it looks different than a GenBank record. It looks different because it's a, both are human readable formats but they have the same data. They have the same sequences. They have the same annotations. They have the same create date, the same metadata, the same authors, the same papers and so forth. They're all the same, but they look different but it's just they look different from a human perspective from a machine point of view is the same data. And so NCBI, you can go look at see all the things they have and you can count, you can get a summary up to date summary of how many records there are in each of the databases they have by using the all filter on their website. So NCBI offers obviously literature through PubMed health records through things like DBGAP and ClinVar will come back to DBGAP later. RefSeq is sort of the reference sequences of all the genomes. And so of course not just human but they have all organisms and they have RefSeqs for RNA protein and genomes. They have proteins, they have 3D structures, they have protein sequences, they have chemicals and PubChem and so on. So the whole sort of computational space is served by NCBI and EBI and many other players. So there's the NCBI homepage. So we're gonna be dealing with DNA sequences this week and RNA sequences but mostly DNA. And they exist in multiple formats and so understanding when you look at a file knowing which format it is, you'll understand you'll be able to know which tool works with that format. And so if you have different formats then you have to sort of deal with different tools. And so description of all the formats and their evolution and so forth is described on this page at the UCSC Santa Cruz website. And I invite you to go look at that if you want details, specific details about particular formats. For the purpose of this week, it's good to think about two main categories. So there's what we refer to as archival or primary databases and then secondary and or curated databases. And this is a slightly misleading because actually even the primary one are also curated a little bit. So there's curation that goes on in a GenBank submission but there's even more curation that goes on in a RefSeq record. And so RefSeq, how many people have used RefSeq before? And where does RefSeq come from? Not the name but the records from RefSeq. Yeah, so RefSeq is an NCBI product, yes. That's an easy, with me lecturing, that's an easy guess. But where does a RefSeq record come from? How is a RefSeq record built? So for all of you that use RefSeq, you should all know this because you're using it. You should know where it came from. So through the numbers? Yeah, so basically, so RefSeq comes from GenBank. So what they do is they pick, so the GenBank records belong to the authors. They're maintained and managed by NCBI but they belong to the submitters. RefSeq belongs to NCBI. So what they do is they take the best one, they think out of the whole pile and then they put it into there, they switch the size, they make a copy of it and put it in RefSeq and start because now it's their record, they start editing it. So they'll put the accession number, the original accession number where it came from but then they'll have it in new, they'll give it a RefSeq accession number and then they'll annotate it. What they do in RefSeq, of course, is they try to have one of everything. So they'll do one of the human genome, there'll be one reference of the human genome, there'll be one beta-gal from E. coli, there'll be one mRNA from each of the mRNA that they know exists for that, for which there is a record in GenBank, they'll have it in RefSeq. And so if you look up beta-gal in GenBank, there'll be thousands of beta-gal. Everybody's sequenced that gene because it's on PBR322 and it's on vectors and blah blah blah and so there is, just because you're in GenBank doesn't mean you can, anybody can submit to the GenBank if they've sequenced it and if it's, even if it's in the GenBank already, you can still submit it because it's linked to your paper because it's linked to your work you did, and that's the version of the beta-gal your sequence and that's the one you wanna submit. But that said, there are thousands, if not more than that, tens of thousands of beta-gal in GenBank. But in RefSeq, for E. coli, there'll be one and there'll be one of each of the proteins and the human genome and there'll be one of each of the mRNAs, the different mRNAs, that said. So if there's multiple transcripts for a given gene, all of them will be in RefSeq. So RefSeq comes from the original data, so it's real sequence, annotated by NCBI folks in a standard way, so they'll do the same from a given organism, especially from a given organism, they'll annotate the genes always the same way, it'll be the name of the protein and whatever the nomenclature will be standardized for that organism and it'll be the same across all the proteins and so it becomes, as useful as GenBank is, RefSeq becomes the most useful thing, especially if you're working on a genome that has been used extensively, like the human genome. So in our case, the human genome, RefSeq becomes a key resource because we have all the references, all the genes annotated, interpreted the same way across the whole human genome. So that's a really sort of key reference. And so RefSeq is an example. So genes is actually another database from NCBI where it has all the full story on any given gene. So if you take P53, you look that up on genes, you'll have the transcripts, the genomes for humans, so you look up P53 in humans, you'll have all the details, the interactions, the publications, the summary and so forth about that gene in one place. And so the gene becomes a very useful place to get information about any given gene, especially when it comes to human genes because NCBI does all organisms but it does the human stuff quite well. A lot of the things that's important to know though is that the interpretation of what a given gene does often comes from what it does in a model organism. So that's why the mods and model organism databases are key in our understanding of the human genome is because there's lots of experiments about figuring out the function and how these things work in the mods and the model organisms is how we get to understand how they work in humans. And that's what this slide is about. Basically, each of the model organisms, Drosophila, mouse, rat, yeast, C. L. Engans and zebrafish all have their own website and a lot of function and annotation and then attached to genes and then you have gene orthologs between human and these mods and then from that you deduct or deduce the function in humans as well. And so that's really sort of central point of how many of these things are done. There's lots of databases that are known. There's actually in nucleic acid research every January there's a database issue. It's like a special issue that comes out and there are so many databases in that issue that they actually can't publish and so every database wants to be in the database issue every year but there's too many databases that they can't do that every year so many databases only show up every second year. Some the top, top tier stuff like from NCBI, EBI and so forth they show up every year but all the other ones they show up every other year and so you always have to look at two issues of NAR database to get a complete picture. So you have here the 2016 and the 2017 edition of that journal will have all the database and those two issues you'll have most of the database that are sort of known in the community which includes, so actually interestingly so the one database I'm gonna talk about today, ICGC I've been bugging the group for them to write a paper to put in a database issue. They haven't done it yet but maybe eventually they will. So GenBag flag files are your standard sort of DNA sequence record that has what I call the header which has the title, taxonomy and citation that applies to the whole record. It has the features for example if it's an mRNA we'll have the amino acid sequence if it's got a genomic sequence it'll have a gene it will have, it may have regulatory sequences and so forth and the actual DNA sequence. You have to keep in mind this is what we call a human readable format. It's meant for humans to read and so forth and so for human chromosomes it becomes sort of not very useful because you end up with too much information. So you start have to using other database to interpret that and so in RefSeq being one of them which has, we can look at things more at the transcript level and gene level and it becomes much more useful. Yeah, so basically GenBag is where a lot of stuff came into and then it went into RefSeq and then out of that came out genes in the RefSeq resources. Just this is historical a little bit but it's good to understand where it helps you, sorry, it helps you sort of understand what you're looking at. When you're looking at accession number of a record it sort of helps you to understand by looking at it actually there's metadata in the accession number itself and so a GenBank number will be a letter and five digits or two letters and six digits and these are all sort of GenBank and slash EMBL slash the DVJ records. Proteins have different letters and but in three plus five that's usually a protein GenPEP accession number and RefSeqs will be and whatever a letter and underscore and a number so that's a very different, so whenever you see that format you know you're dealing with a RefSeq record. Then the other very important thing is the dot version. So what is the dot one? Let's say if I have L12345.1 and L12345.2 what's the relationship between those two versions of that single accession numbers? Different versions, okay, versions of what? Annotation, sequence, both, either, annotation, sequence, how many think it's annotation and how many think it's sequence or how many thinks it's both? Okay, sequence, raise your hand, higher, I can't see it. Annotation, raise your other hand and both. Okay, so version changes only refers to the sequence, not the annotation. So you can have different annotations, it will not change your version number. How many nucleotides do I need changed to change my version number? One, a hundred, a thousand, a million, one. So you change your record by one nucleotide, you delete a nucleotide, you change a nucleotide, it changes the version. So if you see a sequence that's dot 32, that's a 30 second time that that sequence has changed. It's a stale of the same genes, same region of chromosome or whatever, but it's version 32. There's been 32 DNA sequence changes where the coordinates have probably changed. And so when Trevor was referring to the bill 38 and so forth, well, I'll come back to that, but that affects, it's really important because it's a whole coordinate system, but that's true at the GenBank record level as well. So if you have a record, and this is why it's critical, critical when you make a report a sequence that you use in a tool or use in a publication that you include the version number. Because if you don't include the version number, you don't know what the coordinate system is. But if I'm using a dot one of that sequence, then that will always be true to that coordinate system. Because when it goes dot two, if I've shifted everything by 100 nucleotides, because I've deleted 100 nucleotides and everything's shifted by 100, then it's no longer true. Well, with that position, that promoter I identified at minus 20 in front of my gene, or minus 100 or minus 2000, if this sequence is shifted by 100 or one or a thousand or a million, it's no longer gonna be true. It will always be true if you're referring to a version number. So the version number is critical whenever you report a sequence, in a paper, in a journal club, whatever, that you have, you know exactly what sequence you're talking about, okay? So RefSeq, I mentioned there's N star underscore a number. So the star, this is your cheat sheet of all that what the stars mean, genomic, mRNA, NM, so forth. And so N, of course, stands for NCBI. So it's NCBI, mRNA, RefSeq, and so forth. So Np, and there's many more that came throughout the year. Different types of genomic sequences, different types of assemblies. There's sub genes which are, so this is an example of a RefSeq, which is actually doesn't have necessarily an existing sequence. It's more computed model, gene model. And those are not as, they're a bit more rare now, but there was quite a lot of those at the beginning. And some more collections of other types of proteins and so forth. So fast day files, how many of you know what fast day files? I think you know what fast day files are. Okay, that's good. So fast day is sort of the simplest way of representing a sequence, and it could be a DNA sequence, it could be a protein sequence in fast day format. And basically you get a greater than sign, you get a line of description of what that file is, and then you have a sequence after that. And then the next file after that, you'll have a greater than sign, something. The thing about STA is that it has to have that line, but nothing, just that line, just a greater than sign with nothing is also a legit sort of fast day format. It's not very useful because you have a sequence that doesn't have any description of what it is. And ideally you have a unique identifier, and as in this case here, you have a sequence ID with a dot version number in it as well, so that you know which version of that sequence it is. This is the NER database issue has all the top databases that are there every year, not every second year. And there's one that has about the database resources at the NCBI, which will describe GenBank and many of the other resources. Another thing about visualization is a genome browser. So the standard, the most commonly used one is probably UCSC, Santa Cruz genome browser. But Ensembl is heavily used and the least used is the NCBI one for some reason, which has never sort of made it into the sort of, the highly used sort of browser category. The, you and Bernie, who's at the head of the EBI right now with the Ralph Upwiler, they, he did one, a survey, a very good one they, he did one, a survey, a very informal survey one year at a genome conference and he went around in all the posters and looked how many of them use UCSC to represent their genome, how many of them use EBI's Ensembl and how many use NCBI. And he got something like 10 to six to one in those three categories. So there's most, there's a lot more UCSC about half Ensembl and the really small proportion NCBI. It was probably all NCBI people. But because ICGC does things their own way and so forth, so they have a browser as well. And so, and so yet another browser. The importance of this one though, as we'll see as throughout later, is that it's integrated with all the other, the ICGC system. And so you have within the ICGC space, you'll have a browser and it'll help you sort of understand where you are in the map of all the things that are there. And then I mentioned the gene database and it's really the best way to look at information about genes. So this is an example about TP53 and it has all the synonyms and transcripts and so forth. It's a really useful way of looking at things. And I invite you to have a look at that and there'll be links to genes from, for example, from the ICGC, we'll come back to that later. This is the assembly sort of reference page. And so Trevor talked about builds and so there are, and it's important to sort of know which build you're working from so that you know the reference points and on a given build gene X will be at the exact same position all the time because that build makes it that that gene is at that exact coordinate, you know, nucleotide 2,272 is there and that's where everybody, even that build has the same nucleotide. And so it's important to know what those are. The builds are built or, you know, there's updates. So there's patch levels. So there's a patch level one, two, three, four, five for a given build. And what a patch level change will do is it will change the coordinates within a specific area. So a specific gene will have been updated. So they'll update that gene on a patch level, but they will not change the boundaries of that gene. So within the boundaries of that gene, things may get reorganized, updated. And so a lot of the complicated genes which have lots of repeats or the immune system, for example, on chromosome six gets reorganized and updated and so forth. There are still some gaps. Things are not fully, fully finished in the human, the finished human genome sequence. They're still updates until they're done with patch levels. And so going from build 37 to build 38, for example, the time between the builds is getting longer and longer. We're talking about years now. And the way coordinate systems are being used is taking more and more time and it's requiring more work and so forth. And so it's taking, it's quite long. And so we're becoming more and more conservative. Whenever we want to do a new build, there's so much other work that needs to be done besides the assembly of the genomes. It's a whole curation and remapping and so forth that there's really a strong incentive not to do a new build. That said, the technologies are changing in ways that a build, a reference genome build is also almost becoming antiquated in a sense that we will soon, and I'm thinking, talking about like a few years, less than five, I think we will soon have basically de novo assemblies of human genomes. And so we won't have a reference genome. We'll have a human genome assembled from scratch with different technologies. There's long reads, there's 10X, there's PacBio, Oxford, there's other software companies I've heard about recently that are gonna make de novo assembly of human genome, not a reference based genome, but a de novo assembly of human genome, sort of a common thing. And so it won't be a reference based on build 38 or whatever the next one's gonna be. It's gonna be, this is my assembly and this might, it's gonna be a new assembly for every genome basically. And that's gonna be very important for cancer because cancer being the disease of the genome, it's gonna be very important to have a reference genome that is basically a normal sequence will be the reference. A normal sequence of that individual will be the reference genome and then against which you'll compare the cancer genome. And so the tumor DNA versus the normal DNA of that individual, that's what you're gonna be doing the comparison. Right now, both of these are mapped against the reference and then you compare the two references together. It's sort of recipe for disaster and getting it wrong and it's missing, I think still missing a lot of things that are not, if it's not in the reference genome and you can't map it, if you have sequences in the individual that doesn't exist in the reference, then you're sort of lost and you can't map that. And so you're gonna need to build, ideally you'll need to have reference genome for that individual, the normal DNA, and then you'll compare the cancer DNA against. So this is a quick one page summary of the human genome data sort of history. How many of you remember sort of the expression tag, sequence tag, D-B-E-S-T, D-Best? How many of you remember that? Not too many, I guess I'm too old. So expression sequence database. So D-Best used to be short reads against RNA that was a way of looking at a quick snapshot of which genes were being expressed. And that was the standard way of looking at gene expression at the beginning. There was some mapping, of course, population and probably the HapMap project. There are many GWAS studies. Then there's the infamous Homer paper. How many of you know of the Homer paper? The Homer paper made it so that GWAS studies, which were basically genotypes across large cohorts, like thousands of people, made it, were usually actually just GWAS studies were totally open and available to the public. In the Homer paper, they actually identified, they were able to say a given individual with so many SNPs, I can say, are they part of the test control group or are they part of the disease group? And the disease group could be schizophrenia, could be bipolar, could be alcohol, sort of abuse, susceptibility, and so forth. And so they were able to identify individual if they belonged to the schizophrenia group or the control group. The schizophrenia group or the control group. And so because of the Homer paper, all the GWAS study became controlled access. And so only you had to get special permission to access that data. So in a way, it was a negative thing in the sense that on the openness front, but it was an important thing to protect people's identity so that they don't become identifiable as belonging, let's say, having susceptibility to schizophrenia. And so that was sort of a landmark paper at that time. Then there's a cancer genome atlas that Trevor referred to as the NIH-led large scale sequencing of cohorts that have given tumor type. And then soon after that, I'm gonna skip one, or two was the, or at the end there is ICGC, the International Cancer Genome Consortium, which actually encompasses and includes the TCGA, the cancer genome atlas as well. And so there was a pilot, there's a thousand genome project, which was basically genotyping large cohorts of individuals across the planet. And then the actual TCGA project and then the ICGC. And so ICGC, so we're talking now about 2007, 2006, 2008, was the beginning of the large scale sort of analysis of multiple cancers of different large scale. And some of it was targeted genes, some of it was exomes, and there was no whole genomes done at that time, but really sort of large scale analysis of many genes and multiple tumor types. And basically of that came some important lessons. One, that there's a lot of heterogeneity, things that we sort of take for granted today, this is when these discoveries were done. And so like just 10, less than 10 years ago. There's lots of heterogeneity within and across tumor types. There's a high rate of abnormalities. And so you have drivers in the concept of drivers and passengers and we'll come back to that. And the sample quality, so how you handle the tumor block, the tumor resection you're working with, how that's handled and how you manipulate it and how you extract your DNA and so forth is also key. So then the idea of the ICGC came out and it was a consortium of many people included across the world. And the idea was to collect 500 tumors per tumor type of 50 different tumor types. So it's a 25,000 tumor genome project, but you're doing tumor normal. So it's actually a 50,000 tumor. It's a 50,000 genome project. So there's 50,000 genomes of both, half of which are normal and the other is tumor specific. And out of these genomes, you're gonna try to extract genomic sequences, transcriptome, methadone and clinical data. And so the idea is that out of taking that one snapshot before treatment, so naive tumors times 25,000 times 500 per tumor type, you are gonna be able to get some insight as to what are the hallmarks of different tumor types. So why do you think we pick 500 per tumor type? Power, that's a good answer, very good. So if you do 500 without going into the calculation, what sort of range of events can you go find? And like one in a thousand, one in a hundred, one in 10, one in a hundred is about right, yeah. So that means if you do 500, if you sample something 500 times, something that happens one in a hundred times, you'll catch it for sure. And so it's sort of a 1% type event. But what's the big assumption you're making when you're making that statement that we just made? That. The event on for 1% is an event. No, well, that's true, but that's not, that. The event that you're trying to find is going to be captured in those 500 samples. That's one thing, but what's the other? Bigger assumption we're making, yeah. So the big assumption we're making is that the 500, let's say breast tumors are all the same subtypes, right? And we actually don't know that until we do the experiment. And it turns out that breast cancer has got about 10 different subtypes, right? And so breast is probably the worst example. But let's say another one, initially we thought, let's say prostate cancer. There's maybe two or three subtypes there. But if you only do 500 of prostate, and then you've got, let's say, three subtypes, you don't have enough power to pick up things that are at 1%. You're gonna pick up things that are more like 10%. All right, so that's, that was sort of a, so the ICGC is like almost 10 years old now. So this, I can tell you this now, 10 years later, after we decided to start the project. But the power calculation, the choosing 500 was based on the fact that we thought that if we were gonna get 500 of a given tumor type, that it would all be the same tumor type or subtype. Turns out we didn't quite, not quite right. Okay, but that's an important thing to do. That's one of the challenges of ICGC. So anyway, so the ICGC, that said, the ICGC scope is humongous, basically is to do 25,000 genomes, plus they're match normal. And so no one country can do that, although one did try almost, but no one country can do that. So it's an international effort. And so we got the ICGC together to do that in that way. So I have half an hour left and I have lots of slides, so I'm gonna try to speed up a little bit. But working together made it possible to do standardization and to do some quality measures and to try to do the same thing across all the various tumor types. So in the cancer database, the ICGC cancer database, you wanna have information about metadata and so forth about the project, about the patients, about the tumor, about how the sample was prepared and handled, where the DNA is available, the analysis, the versions of the tools of the analysis, the versions of the reference of the genome against which it was made and all that kind of stuff. So ICGC as of this year, so next year, so spring of 2018 will be the 10 year anniversary of ICGC, it's a 10 year project. So we will have the goals to finish the 25,000 genomes. Not sure we're gonna quite make the deadline, but we're sort of, we have more than 50 projects. So we have 89 projects now, so we have more projects than not all 500. Some of them are 100 or 200 and the reason being, why would we accept a project with only 100 tumors as opposed to 500 rare types of tumors, exactly. So there's a number of tumor types which are, it's really hard to get 100 and so you'll wanna have include those in the project as well. Every continent is included in this project except Africa and so that's an unfortunate thing and in Africa of course has the most, there's more genomic variation within that one continent than there is within the rest of the world and so we're actually missed a great opportunity by not including projects from that part of the world. The databases are growing, so then I remember, I think the second or third year we were teaching this course is that if you, that was one comment in the survey was, I don't wanna ever see a growth curve again about how the databases are growing, tired. I keep doing it. And so anyway, they're growing, so we have to deal with that. I think it makes the picture clear. So ICGC has a website, it has basically, it lists all the projects and it has links to the projects themselves. The big parts from the homepage of the ICGC website is that the data portal link at the top, that's the most important link probably. The DACO is the Data Access Compliance Office. This is if you wanna have access to human data, so if you wanna have access to the controlled access part of it. So the mutations are not controlled access. So all the mutation data is all open access but if you're on germline variants, those are controlled access because germline variants can identify the individual. Somatic mutations cannot identify the individual except if you have the DNA of that individual and if you have his or her tumor, then you can say, oh yeah, this is him or her because I have the same tumors, the same mutations. But from a, you cannot look at a mutation profile and know who the person is. There's information and then there's login as well. So this is the DCC, the data coordinating center. So this is the hub, the data hub of all the ICGC data and it has lots of quick links to number of sites and number of, sorry, of pages within that site. It should be, so the gene aspects and the advanced searches are the two big important ones that are on this page. The cancer projects. So these are all, all the pages have a, what we call facets on the left panel. So you can go select which of the tumor types you want to look at, which gender, which gene and so forth. So there's ways and when you do that, it automatically updates the right panel, which has a summary of what it is you selected. Initially it has everything. So it updates dynamically when you load the page and then you start clicking on the left panel the things you want or don't want and then it automatically updates that. So it has it at the cancer project level and it has a project and then if you go drilled out to the project itself, they will tell you how many of the various things there are in that project. There's a gene page. So if you drill down to a gene, then you will show you, you can look at that gene and all the tumor types. You can look at that gene in a specific tumor type or in a group of tumor types. If you're interested in, let's say, pancreas and stomach cancer, then you can look at those two cancers together and look for that gene and how it's mutated. And it now has, since last year, it has a reactome pathway and reactome we'll be looking at on the last day of this workshop. You'll have a half a day on looking at pathway analysis in a reactome. And each mutation and there are millions of them. Each mutation has a single identifier and they're all labeled and querable within it. So you can look for the same mutation in multiple tumor types. And so if you find a driver, what you think is you call a driver mutation. Does everybody know the difference between a driver and a passenger mutation? Anybody? A driver mutation is driving the disease and it's leading to disease growth and if you target it potentially, you could have a regression of the tumor. Whereas background mutations are there, but it's not necessarily contributing to disease. Yeah, and so in any given tumor, you will have many mutations that have no consequence, inconsequential or don't appear to have a consequence. Either they're in a part of the genome that if modified does nothing or they're part of the genome that even though it's an important part of the genome, but it's not a protein that's involved in driving that process. And so one of the many of the tools we use is to try to identify what the drivers are. And so that's sort of an important, sub-discipline of cancer genomics or bioinformatics. And so like I mentioned, there's a browser as well. And so then you can do so advanced queries. So interesting thing about advanced queries is that you can also save them, come back to see later once there's a new release of the database. And right now we're at release, I forget the release number, but there's two, three more releases left till the end of the project, which means that after that, it will probably, the release cycle will probably diminish because ICGC will end and then we're gonna enter a new project called ICGC Med, which I won't have the time to talk to you about, but during the lecture, but I'm happy to talk to you about later, which is basically ICGC too, which has a, it's a more longitudinal study and before, during and after treatment and to clinical trials and so forth and much larger scale. So ICGC was 25,000, ICGC Med is gonna be 200,000. So a much larger scale and it's gonna be about 5,000 per tumor type. So it's not gonna be 500, it's gonna be 10 times more. So we'll deal with the power calculation challenges. There's some analysis tools on the portal as well, some heavy sort of repository. So all the versions of the releases are and the previous ones are all present on the data portal. And so for example, if you go on the release page, it'll show you all the various versions and all the files and all the tumor types and you can download that if you're so inclined. This is an example where you can download a file onto your laptop and I've, in my case here, I made a directory I call the 2017 bioinformatics cancer genome data. And so, and then I can sort of go manipulate that with some UNIX tools and things like that. You're starting to give me the, okay, you're just standing up. Okay. So I mentioned ICGC and TCGA. So there's some little got yous that's really important to understand about these databases. So OICR is actually the host of the home of ICGC. And so we take, so ICGC software developers are a floor up down the hall here. ICGC is actually part the TCGA and part the rest of the world. And so TCGA is US and the rest of the world is the rest of the world. And so whenever we talk about ICGC and TCGA in the same sentence, what we actually mean is TCGA and the non TCGA part of ICGC because ICGC as a whole includes TCGA, right? But we talked TCGA versus ICGC. The full sentence should be TCGA plus non TCGA part of ICGC. So the big challenge there is that there are some subtle differences between the way those two groups are handled. One is data access. So one for TCGA data access you have to go through a DBGAP. So DBGAP is a website at the NCBI where you need to ask permission to ask. I wanna have permission and you have to fill out the forms and so forth to access TCGA. And for the rest of the world, there's fortunately not 17 other places you have to go get permission from. There's one other place and it's a DACO and it's on the ICGC data portal side. And so from there you can get access to, now here I'm talking about controlled access, not somatic. So as I mentioned, TCGA is part of ICGC and there are lots of differences that they cover different tumor types. There's different definition of what is controlled access and I'll come back to that. Different data access rules, different geographic rules and many countries versus one jurisdiction. So ICGC, what's open access versus what's controlled access, think of it quite simply as controlled access is whatever data I can use to identify an individual. So the best example of that would be germline variants. Some clinical data, which is not very populated but if it geographically identified a very small population, that would probably be considered controlled access. But the most common controlled access human data for ICGC is germline variants. So that uniquely identify an individual. Open data is everything else. So summarize summary of gene expression data, somatic mutations and so forth. And somatic mutations, whether or not they come from exomes or whole genome. And ICGC is about, I'd say about 10% of the genomic sequences in ICGC are from whole genomes and 90% are from exomes, okay? So about a 10 to one ratio. So keep that in mind. So TCGA, sort of basically the same idea except that TCGA does not consider somatic mutations that were extracted from whole genomes to be open. So those are deemed controlled access. And the reason for that is that they think that the software tools that were developed to extract somatic mutations from whole genomes are not 100% perfect. And then there's actually leakage of germline variants in that data set. So they think that there's enough leakage of germline variants in the so-called somatic mutation data set that they will make that data set controlled access only. So yeah, that's a very good question. Is it all bad? I'll keep hearing that. Yes, absolutely. And I think, so that's been addressed in, so the quality data, I'll come back to that later. It's a very big concern, a big issue because we talked about pipelines with Trevor earlier today. The problem is right now that all these countries, all these projects all use their own pipelines. There's no standard pipelines that are being used in ICGC. And now if you run even the same pipeline in two different labs, you probably get different results. Yes? TCGA have naturally, is that one data? TCGA have what? Yes, they all have, so everybody, so TCGA and ICGC and by that the non TCGA part of ICGC, they all have two or normals, pairs. And then normal is usually blood and most tumor types, except in leukemias and where they take it. So the hard part about the liquid tumors is to find a body part that doesn't have the liquid tumor in it. And so you have to find from which you extract normal DNA. And so what they usually do is they do like a skin biopsy and they flush out the blood out of it to minimize the amount of blood that would be in that tissue for which they would extract DNA to get normal DNA. But all the other body parts, say pancreas or whatever, then you use blood as your normal. With the caveat, of course, about circulating cancer DNA, which is everywhere. And so, but that's much lower threshold. And so it's not as a big of a concern. But it's very important in your experimental design and description of how you're gonna do things to really describe those things well. Okay, yes. So from the TCGA, if we are looking for a dataset for a particular cancer tumor versus the paired normal. Yeah. Because I want a paired normal, would that be considered control excess? If I don't care. Do you want, well, so for the normal DNA, what are you looking for? You're looking for the germline variants or you're looking at the reference sequence? What are you looking for? Even if it's... So if you're looking at the reads from a tumor or a normal, the reads themselves will contain, so if the raw data, basically, if you want to download the raw data, all the raw data is all controlled access because it includes all the germline variants from which you can deduce the germline variants from that raw data. And so that all of that is controlled access, whether it's TCGA or ICGC, okay. So, the definition of what is considered controlled or open access versus controlled access. And so there's a file, for example, that we ICGC, I'm not there anymore, so that they make available to the community. It's a file that has all the somatic mutation variants of all the tumors in one file. And the file is about a gig compressed. And I write 30 gigs here and actually, I gestivated that number because I started download, I was in traveling last week when I was writing, updating this slide and I started downloading this one gig file in my hotel room and it took like three hours to download. Hotel Wi-Fi is not very good. And so I guesstimated it is about 30 gigs, but it's actually uncompressed. It's about 20, 20 some gigs, but still it's a big file and it has all the somatic mutations from TCGA, ICGC, the whole bang, and to one, this is open file. So after what I told you earlier on, so what's not included in here? No, there's no germline, that's true, but that's not what I'm referring to. So this is all the somatic mutations from all the tumor types, that's in all of ICGC in one file. So while they were written. So it's not gonna include the somatic mutations from whole genomes from TCGA, right? It's gonna have whole, it's gonna have somatic mutations from whole genome from ICGC, but not the ones from TCGA because TCGA considers those controlled access and this is an open file. Or you don't need to log in, you can just go to this website and download this file. It has all the somatic mutations, so it's good. So now I told you about 10% whole genome, it's about true for TCGA and the non-TCGA part. So that means about 10% of the whole genome from TCGA are gonna be missing. So if you're doing analysis of how frequent is this mutation in all tumor types, you're gonna be missing a 10th of a half from whatever tumor types are done by TCGA versus others. If it's a tumor type that's only done by others then you'll have everything. But if it's a tumor type that's done by TCGA, you're gonna be missing those. Yes. Yes, a few. So in the file, I think there is, well, there is which tumor type it is and then you can, there's lists of all the tumor types and where they're from, which countries and so forth. And if it's from the US, then it's gonna be NIH rules and it's gonna be if it's whole genome and the methods will be there as well from which it was extracted. So you'll be able to deduce that. All that metadata is present in the files. Sorry, one more question. Did you say this is only for whole genome or TCGA? If you have exomes, then it's still okay. Exomes, so mutation from exomes are open. Mutations from whole genomes are not open if they're from TCGA, they are open. So different organizations have different definitions of what is controlled open access. There's a bunch of things which are similar between TCGA and ICGC. And they're both the mutations that are, so that file basically I mentioned with all the somatic mutations from which is missing the whole genome from TCGA. That file ends up in Cosmic, which is a cancer database at the EBI, sorry, at the Senior Institute. And so it's a catalog of somatic mutations. So Cosmic is not an ICGC, it's a separate project from ICGC. They collect all somatic mutations in cancer from wherever they come from. Obviously ICGC and TCGA has been a good source for them, but they'll also get somatic mutations from a paper that will have sequence one gene in 5,000 individuals. And they only score the mutations in that one gene. So Cosmic will have mutations from studies that are very specific, they're not necessarily whole genome analysis. So you have to be careful when you look at the Cosmic data and you look at the numbers that it's not true, the numbers, there are some genes because they're purported or postulated to be involved in cancer will be oversampled in Cosmic, right? But that may be a good thing because you'll have a saturation of every position in P53 across thousands and thousands of individuals and hundreds of, no, hundreds, let's say 150 to 80 plus cancer types. So you'll have a good saturation of that one gene, but there'll be some genes which will be only represented in whole genome analysis and so forth. And so there on Cosmic page, you can have a file which will have the same thing, you'll have where it came from, how many times it was seen and so forth. So this is just to quickly introduce you to Data Access Compliance Office, the DACO page. So this is a third page I mentioned of the ICGC website and this is to give you access to controlled access data. So first you have to register on the website and once you registered, then it asks you if you want to do a controlled access form, then you fill out the form and then you sort of press validate and it tells you you forgot to fill out this field and you fill out these fields and it turns green and you're happy and everything is a hunk of door. At the end of this process, you end up with a PDF file that you print and then it requires signatures from your institution. And so usually let's say at OICR would be from somebody who would be either the CEO or somebody number two deputy director at OICR. And the university might be the VP research or whoever signs off on grants when you write a grant in your institution. And basically what you're writing in this form is you're saying you're not gonna abuse the data, you're not gonna try to re-identify the individual, you're not gonna do any, you're gonna destroy the data after you finish your work and so forth and you're getting it signed by somebody who can fire you should you break any of the rules that you just agreed to, you won't do. So it's really, it's a person, so it's a document, it's a legal document and then it gets submitted to the DACO office which is actually a bunch of lawyers that work for ICGC, lawyers slash bioethicists slash genomics legalese people that understand genomics, understand law and they will, you're entering a contract and so they're giving you access to the controlled access data because you're saying you're gonna do good things without you're not gonna do bad things. So it's a very serious document and it's a bit complicated, I dislike it because it reduces the number of eyeballs on the data and reduces the opportunity of people making discoveries but it's a necessary evil because the patients who donated their tumors want us to do this, they don't want us to limit the eyeballs on the data but they want us to make sure that the scientists are gonna be, are you gonna do good things with it and only those and nobody's gonna try to do bad things with this data, okay? So it's an important sort of thing and once you're, so one of the IDs that we use in the DACO is an open ID which is basically if you have a Gmail address that's your open ID so that uniquely identifies you and you should have a unique open ID not an open ID to share with everybody in your lab for example and once you log in then your name will show up as logged in on the website and then you then have access to the data. All the data is submitted here, there's a moratorium, there's basically, there's a whole long document on this page here which says that basically for the first year if there's less than 100 tumors you can't download the whole data set and write a paper about it unless one has already been written so if there's already a paper that's been written about this data set then it's available and basically this summary of this legal document I represented in this one figure which is basically if a single genome is submitted, it's there for two years then you can write a paper about it in two years. If it's 100 genomes then one year later you can write a paper about it. If there's a mixture of these things and whenever it's 100 genomes for one year then you can write a paper about it. If it's published in less than a year then once it's published you can write about it. All of this is to protect the author that generated the data so it gives them a year or two depending on how many genomes there are to allow them to write a paper and so how do you know what's what? So there's a page actually has two pages one for TCGA and one for the ICGC, five minutes, thank you. And that has whether or not there's a moratorium on a given data set so that's a good place to check. Right now I would say that most there's very few data sets that have moratoriums still on them because they've been there for a long time and or they've been papers published about them so you can download it and you can do analysis and you can write papers. So this, so the DACO as I mentioned is for ICGC ERA slash DDGAP for TCGA and then you have to go to these respective repositories. So the DACO or the ICGC puts all the raw data at the EGA right now and that's probably gonna change in the future but right now that's what it is and then most of the raw data for the TCGA used to be a CG hub and now it's in Chicago at a new TCGA host. So and then the main thing you get from the is the raw data and then the germline variants themselves. One last note is that there's lots of documentation so on the ICGC website so docs.icgc.org and there's a list of all the various places where the data is available and there's a new TCGA portal that's just opened recently at gdc.cancer.gov and if you use this portal you'll see that it's very similar and it's user interface with the ICGC data portal because it's the same software developers that are both. So it makes it more convenient for the users that have used one than to go use another a new one to see the sort of similar type of interface and features and so forth that are available in both. So as I end my talk right on time I can tell you about two things that I didn't talk to you about which are really important and it turns out that these two things are part of another workshop we give two days on big data and genomics. So one is PCOG and the other one is cloud computing for ICGC. So those are if you go to the data portal dcc.icgc.org you'll see links to PCOG and to the cloud computing and we have a whole separate workshops on how to start VMs and do computes and data access and so forth and use Docker and so on and so forth. So PCOG is a Penn Cancer Analysis of whole genome. So we took all the whole genomes from ICGC and we actually did the thing that I told you that we didn't do is that we put everything through the same pipeline. So we remapped, we took all the raw reads mapped them all the same way all so it's 2,800 whole genomes across about seven or eight cloud infrastructure across the world and it took like a year and a half almost to realign just redo the realignment. I was just and we're talking like about it petabyte of files and so forth. And then we pushed them through three separate pipelines for variant calling one from the Broad one from the Sanger and one from DKFZ and then we integrated the results from all three we did some validation where we re-sequenced all the calls to see which algorithm works better under a white condition and then we put them through all the various structural variance pipeline, copy number variation pipelines, transcriptomic pipelines and so forth all the same pipeline the same way for all data set and we discovered that this is the way you should have done it in the first place and so we did it as a not quite a pilot a separate project for PCOG and the PCOG papers are being written right now and so they'll be out probably this time next year and that will probably become another part maybe another workshop or whatever you just to deal with that special data set. And the cloud computing part is that a lot of the data is available in Amazon in Azure from Microsoft in other private cloud like the Cancer Genome Collaboratory at OICR which is another cloud here we have academic cloud and again you have to be aware of the geopolitics here I know who would think that cancer and geopolitics in the same sentence but for example, what's available at Amazon which is the non so TCGA can only exist in the US you can't leave the country so TCGA data can only be hosted in a US site and it's about two or three sites in the US that can host it you gotta get special status to do that non TCGA can be hosted several places Ontario at the BI and a few others Barcelona and so forth but so Amazon which could be in West Virginia or it could be in Ireland and what have you is hosting the non TCGA part of ICGC of PCOG so PCOG non TCGA ICGC minus Germany because the Germans don't want their data hosted on any American owned company because they pissed off the Americans for having spied on them and listening to Marco's phone conversation several years ago and they still pissed off and so the Germans don't allow their data basically outside of their country either but neither do the US, right but that said so the ICGC data on Amazon is different than the ICGC the same data set at the Collaboratory in Toronto so Toronto can have, we have everything so and if you go look at the Collaboratory website you'll see the numbers, the slightly larger than the numbers of Amazon because we have the German data and they don't and so anyway, so all of that to say that I'm not gonna talk about that and if my final slide, oh three minutes over sorry, it's not so bad just summary reminder of all the URLs and thank you very much