 So, as I mentioned, my name is Francis Ouellette. I'm based in Montreal, do bioinformatics consulting, but I was a 10 years here associate director of bioinformatics and biocomputing and had a team of people involved in various aspects of pipelines and database and working at ICGC, what we'll talk about today and so forth. So, actually, so all the workshops as we mentioned at the beginning have a creative commons license and it's a CC buy and CC essay, which means by you have to say who you got it from, who is the author, and so you have to cite them and share like them means that if you use data from these slides, you have to share yourself, you have to share like and the sort of a bit of a viral sort of it's you have to pick which CC license which creative commons license you want and we chose the share alike to encourage people to share and we think it's really important. The license also allows you to edit the things, which means is that you can take a slide, you can adapt it, you can modify it, you can take one slide out of the slide deck, you don't have to take the whole slide deck, you can just you need one slide, you can just borrow it and they're available in most lectures, but not all but they're mostly available in PowerPoint and PDF in video as well and so forth. So I encourage people to record, tweet, film, whatever they blog, live blog, whatever you want to do, but you should be paying attention to me during the lecture too. Disclaimer is that I may mention some products, some companies and what not, some years I don't, some years I do and I don't profit from any of these companies. I worked five years at NCBI and so I tend to promote NCBI equivalents over let's say EBI equivalents and that's just because of where my path of my career path and I'm also a former OICR employee and so I tend to top up fondly of OICR stuff. So this is my email, my Twitter handle and the Twitter handle for the Bioinformatics workshops, bioinfo.ca. I sent a tweet out already this morning showing the first class and picture of the group and retweet and advertise and so forth, please do so. So today we're going to review databases, some visualization, but also mostly sort of getting you out there and getting you started and really hopefully to sort of conclude the foundation that Michelle was just talking about. So this workshop is Bioinformatics for Cancer Genomics, if you're in the wrong workshop it's now's the time to leave. And so we sort of looked at Cancer Genomics so far this morning and so we're going to be looking a bit more at Bioinformatics this afternoon and we're going to sort of do the basic stuff as to why do we have Bioinformatics and we have Bioinformatics because of the open data that we have to deal with. I mean if we didn't have open data then nobody would have to make tools to address all this data. It was everybody's private data sets that would have not been the sort of the upsurge of tools developed over time that would allow this to happen. So how do we define Bioinformatics or Computational Biology? And people define Bioinformatics and Computational Biology differently. I tend not to do that, I tend to interchange terms and so what we're going to do right now is that we're going to think, pair and share and write down, so you're going to talk to the person sitting next to you and if there's only three of you, the three of you can talk and share together and write down a definition of Bioinformatics. Now, start talking, start sharing, start writing what you think Bioinformatics is. No googling allowed, no, no, no, no, laptops down, don't look it up. I mean you're all registered for Bioinformatics workshops, you should know what it is. I hope so. Okay, write it down now because I'm going to end this exercise very soon. Okay, who's got an answer that you want to share with the class? Yes, go in the back. Okay, that's good. Anybody else? Different, different, slightly modified, tweak on what you heard so far? Okay, that's good. Anybody else? Different, different, slightly modified, tweak on what you heard so far? Okay, yeah. Using technology of any kind, as long as we're not underlying biological. So Bioinformatics would include monitoring birds through video filming, migration, bird migration? I think one could make the case. No, no, but I mean you're, somebody else? I'm sure there's somebody over here in the back row. Can somebody, can't speak? No, just kidding. Okay, you don't include metabolomics? Yeah, I guess, yeah, the omics in general. So here's my definition, but I wrote a textbook on it so I sort of thought about it. So Bioinformatics is about integrating biological themes together with the help of computer tools and biological databases and gaining new knowledge about the system and study. So there's, it's a scientific discipline which is not just a technology, I think. And this is where people, computational biology and Bioinformatics sort of get tossed about. Bioinformatics is more technology and computational biology is more science. I sort of not fully agree, but I can see the point there. But it's working, and so the important part about Bioinformatics here, I'd like to sort of carry forward, is that it's actually biology that's driving the exercise. And so we're asking biological questions and we're using computational tools to address those questions. So we use computational tools to look at the data and we're talking about big data now and so forth. So it's really sort of important to keep that in mind. But all the definitions are good. And computational biology and bioinformatics, there's three big areas. So there's data, the actual data that we work with, the tools that we apply to this data, and the knowledge and how we capture the knowledge and how we transfer the knowledge to other people, how we share the knowledge and how we make progress so we understand the development of a new drug that's affecting a specific gene. How do we tell the world and how do we share this knowledge? And so that's a really important part. And so open science is really sort of critical for all of this. And so we saw some of these logos this morning. The open source logo, the open access deals with publication, so access to publication so that even though many of us are in very rich universities that have very good libraries that have all sorts of journals, there's a lot of places in the world that are equally brilliant scientists but don't have access to some of the journals that we have access to in our universities. And really, we're not always within the context of having the passwords of our, sometimes we graduate from universities and also you find yourself working in a small biotech that doesn't have any access to any library. And yet you've lost all access to a lot of journals. And so it's really open access, publication is really important. Open data we talked about this morning as well is a really sort of important aspect with the restriction on controlled access and so forth. And I'll come back to that a bit later. And Bioinformatics.ca is actually part of a goblet which is a global organization for bioinformatics learning, education and training. And it basically encourages all the education material that we build and distribute to be available. And so it's not just Bioinformatics.ca. If you go to the goblet website, you'll see lots of organizations worldwide that are a member of goblet that make their material available. So I mean BLAST, how many of you have done a BLAST search before? Raise your hand high if you have. Don't be shy. So BLAST is a basic local alignment search tool. It allows you to search at DNA or protein databases with a DNA or protein or RNA or nucleotide or protein sequence. And it would not have been implemented if GenBank and ATLAS, which was a protein database at the time at the early days, didn't exist. And so there wouldn't have been a need for the development of a thing like BLAST or FASTA and FASTA and FASPE, which are the predecessors of those tools. And so GenBank, the openness of the data with GenBank being the nucleotide and protein sequence databases of all publicly available sequences from all organisms, that database being open, one way to search it, sure you can search with keywords and search for your favorite gene, but the way you want to search GenBank is you want to search by sequence similarity. And so BLAST allowed you to do that. And so the development of a BLAST, which is also, it's actually, it's more than open source. It's actually, it's owned by the U.S. government, the NCBI, which is a U.S. government agency, owns the BLAST code and it's actually, they just made it public. So it's free for anybody to download. It's so free that you can download it, repackage it and sell it if you want. I mean, if you want to sell the, you can sell the BLAST code that you downloaded from NCBI. You're totally, it's a totally legal thing to do. And so the openness of that and BLAST, because of that, has been remastered and redone and reused, the code base has been used and modified and so forth and used in a number of different ways. So when we're doing bioinformatics and we're doing a BLAST search, you're actually doing an experiment. People don't think of it as a sort of a biological experiment, but it's really the same thing. We have a sequence, we're going to do a BLAST search against it and we're going to get a result and we're going to interpret the result. So it's important when you do an experiment that you know what your reagents are. And so you know the query that you're using and the database you're searching against, those are your reagents, right? And your method would be which BLAST, kind of BLAST that you're going to do. You're going to do a protein against protein or a nucleotide against protein or by translation and so forth. So there's various types of BLAST searches and knowing the method which you're going to report when you do your analysis and you do your interpretation is going to be part of that. And the interpretation of your alignment is your testing hypothesis. Is this sequence, this gene that I sequence, is it similar to an existing gene for which there is information known and you're testing that hypothesis. And so you have to know your reagents, you have to know your methods and you have to know, you have to do your controls. So what's an example of a control you would do in a BLAST search? Let's say, yeah, I'm just going to randomly say that like that. What's an example of a control you would do for a BLAST search? Could be, yeah, so what would be the control? So let me help you guys a little bit. So one control could be, you know a gene is in the database. So you expect to do whatever parameters you choose in your BLAST search because there's lots of parameters you can pick. You expect to find that gene because you know it's there. So if you do the BLAST search and you don't find anything, I would say that's a good control that maybe you're not configured your BLAST software well. But that's an example of you're expecting to see something and if you don't see it, then there's something wrong. If you do see it, then it's right. And vice versa, if you expect the gene not to be present and then you do a search and you find a bunch of things, then maybe your parameters are too loose and you're finding a bunch of similarities against things that are not real. So they're bogus hits, which is what we call a hit in a BLAST search. So you have to think about these controls when you do experiments and when you submit and I'll come back to some of those later. So databases is a way to organize information for us. It's a place where you can put things in it, in a database and if all is well, you should be able to find it back out. I mean, that sounds silly a little bit, but I've made submissions to large databases and people have told me, I've told people, oh yeah, it's in a database. People go make queries at that database and they can't find it. It got lost. And it got lost for simple reasons. The tags that were used to classify the sequence are in a specific slot and when you do a search in that database for that tag, if it's not in that slot, it doesn't index it basically. And so the tag may exist somewhere else than the file in the XML or whatever, but if it's not at the right place, then it gets lost. And so we used to here submit all the data or help and control all the data and monitor all the data that was submitted for the ICGC at the EGA database. And it's much better now, but there was a time where 30 or 40% of the sequences disappeared because they were not submitted properly with the wrong tags at the wrong place. And the way we found the data is we dumped everything and then we started grabbing and looking for file names and so forth. They were all there, but they were at the wrong place. And so it's a really important thing to get right. But that's a good check to do. If you submit something and the repository where you deposit it, it says, okay, okay, we got to thank you. Then you go back out and see if you can get it back. If you can't, you know there's something wrong. Ideally, a good database becomes a resource for other databases, right? So the RefSeq database at NCBI is sort of the standard reference genome for all organisms, including human. And that is used worldwide by many, even the EBI uses RefSeq and UCSC uses RefSeq and so forth. All the big databases point to, and so they have an API, which is an application programming interface, which allows you to write queries that will produce answers from the data that you have in your database. And so being able to do that is really sort of key and it makes your database that much better. It simplifies the information space by specialization. So you'll have cancer databases and so you'll have, it's simplified in a sense, you'll be sort of cancer-specific or you'll have other databases that will be yeast-specific. So the Saccharomyces genome databases, if you're a yeast person, that's a database you're going to use. I heard there's a yeast person somewhere over here, yes. And so SGD is used by all the yeast people because it's got a repository of all the yeast, gene names and lots of experiments of very, very steps. And ideally, you can make discoveries. So what does that mean, making discoveries in a database? Is you find associations between records, which the people that put the data in the database in the first place didn't think about, but you're going to do queries and you're going to pull out things together that nobody had thought about putting together. And that's really sort of the really important thing. And for that, an important thing to understand is the data model. And what do we mean by data model? So for example, what's the key identifier for a given record in that database? Is it a gene-focused database? Is it a nucleotide-focused database? Or is it chromosome-focused or is it cancer-focused or whatever the focus and the related sort of boxes and so forth and the relationships between the entities within a database is really important to understand. And you find that information out? That's a very good question. So good databases will have published a paper explaining how that is organized. And my friends at ICGC, so the ICGC database has been out for about 10 years and the paper just came out last year. And finally, it took a while. But I worked on it many lives ago. I worked on a database of protein-protein interaction. And that's the first thing we did was to publish the data model. And so we explained what the entities were and so forth. And so you publish it. You get a peer reviewed and so forth. And that's really sort of important. Other databases like GenBank and things like that which have public yearly or bi-yearly publications have usually on their website quite detailed sort of data model and how things are organized. But that's a very good question. So there's metadata about data. Metadata is data about data basically. And so create date, update date, submitters or kit. Everybody know what an ORCID ID is? ORCID? Any ideas? It's a database of authors. It's authors. So it's a unique identifier for authors, for scientists. So in the ORCID database, you should all have your ORCID ID memorized. No, tattooed. No. You should all have one anyways. And if you don't, you can go to the ORCID website and they'll be happy to supply your one. And so there are journals. I'm an editor at Plus Computational Biology. And there, we actually all the editors all have it on the website. We have all our ORCID IDs there. So you can go see what is actually we publish, is actually associated with our names. So a publisher would be another example of metadata, book title and so forth. So the data itself could be a DNA sequence file, could be a cosmic record. Cosmic, we'll talk a bit more about it later. Protein-protein interaction record, the title of a book or the book itself. It could be the data. A storage system could be a box where I put everything into. ORCID is a very expensive commercial relational database system. MySQL is now commercial but much less expensive. A binary file or a text file or a bookshelf would be a storage system. A query system could be a list you look at and say, oh yeah, do I have this book in my bookshelf, a catalog, an index file, SQL, a structured query language is a way of looking up, it's a special sort of language to ask queries of a database of relational databases. Elastic search is now used in things like Amazon and Hadoop and so forth, larger scale. And grep is a Unix tool that allows you to look for a string match against a file, and I'm sure you all used grep last week when you were doing your pre-workshop assignments. And so you'll use it again this week. And an information system, that's sort of the overarching sort of system that's, so the Library of Congress could be viewed as an information system. Google could be viewed as an information system. In our world, we have entree and ensemble are both sort of information systems so they're complex entities of information that work together. The UCIC genome browser and the ICGC are all sort of ecosphere of information that sort of work together and track with together and have rules that, and obey these rules within this complex. So a place like NCBI will offer you a way to submit your data. And so if you sequence a gene that's of interest, that's never been sequenced or a variant that does not have any, that's never been sequenced, you can either submit it to a variant database or a new gene would be the GenBank and things like that. You can download, you can learn, they have lots of workshops and you can develop lots of software development, lots of analysis tools and ways to research. So if you go to this page query, it's actually, and you put this query in all square bracket filter, a closed square bracket, you will get actually the, and this one is from last year but basically you could do it this year and you would get the number of records and each of all these are all the major NCBI databases and so you can see how much has grown from last year to this year if you do the search yourself right now. So things offered at NCBI from the literature point of view is PubMed and PubMed Central, which have, or the PubMed Central would have the full text articles of open access journals. From a health point of view, these things like DBGap and ClinVar, on the genomic side they would have whole genome sequencing sequences and ResSeq, on the protein side they would have sequences, protein sequences and 3D structures and the chemicals they have, PubChem and BioSystems and so forth. So they're offering quite a bit of the kind of stuff that we use. So, formats, so we talked a little bit about FASTQ this morning and tomorrow we'll get even more details into the specific sort of tags that are used in the FASTQ file but FASTQ files are what comes out of a DNA sequencer. So there's the actual base call and how good we feel about the quality of that base call. There's actually a, for each base there's a value of the quality of that base. In a GenBank file, or FASTQ file, which is just a nucleotide or the protein from a DNA, from a GenBank record, you do not have that. You assume in a GenBank record that every nucleotide is correct, every nucleotide is 100% that, and that it's a truth basically. And it often is because the GenBank record is actually the result of multiple sequencing and controlling type experiments. It's the alignment of the consensus from a large scale of alignment. And if there's uncertainty, then it's marked in the code itself. It'll be an N or the other alternative nucleotides. And that exists there. And then, and then we're going to deal with this as well, this week is SAM and BAM. So those are alignment files. So those are, when you have your stack of, when Trevor was talking about, you know, 4DX coverage, that file that we were looking at was actually a BAM file, which was actually used to exist as a SAM file. So BAM is just a binary version of a SAM file. And so those have, SAM is a readable, obviously text readable file and I human readable file. And BAMs are only, because it's binary, can only be read by software tools, which was some of the things we'll be doing this week as well. And then VCF or files that show the variation of the files that you're looking at. So now you're starting to get into the weed of the, the other file formats. And so, like Trevor talked about it, it's sort of the pipelines, you know, we'll, we'll start with a FASTQ file, they'll get transformed into a BAM, SAM file, again, transformed into a BAM file and so forth, and then VCF and so forth. And there's many, many more. So if you have any questions about all the various file formats, the UCSC, so that UCSC is a University of California that accrues, which is the home of the very famous UCSC genome browser, which of the, it's probably one of the best genome browsers of the generic type that are available out there. But they've also, they've been around for a long time and they've accumulated a lot of very useful knowledge. And one of which is they have a page which has all the various file formats and they click on each one and they have a description of these file formats. So I separate databases in sort of two sort of categories. So one would be sort of primary, it sort of more becomes archival, and then the other one becomes second, I call secondary or curated. And the secondary curated databases are, are the result of often of human activity that where curation takes place and where there's a value added above the archival primary data. And so, and the line is quite fuzzy between the two. So GenBank, for example, is a, people submit sequences to GenBank and then they show up in GenBank. But they actually get curated and there's a, actually, I used to be in charge of GenBank when I was at NCBI. I had 25 people working for me full time and we're curating all of GenBank. And so it's, but GenBank is part of the ENA, which is European Nucleotide Archive and DDTJ, the DNA Database from Japan. And those three together. So any record in GenBank could have come from Europe or Japan. And I'd say the ratio is about, say 50 plus from GenBank, 30 plus from ENA and 15 or whatever the difference is from DDTJ. And it would vary from category to category. The short read archive is another example of primary database. Uniprod is a protein database at the EBI. PubMed is publications also at NCBI, PMC PubMed Central. Intact is a protein-protein interaction database. I would say it's the, there are dozens if not, not maybe not hundreds, but between 10 and 100 interaction databases out there. Some are specialized. And Intact is the most sort of all interactions and mostly protein-protein, but there's some protein RNA as well. ICGC is hosted in part here and in part at the EBI. We'll come back to that later. And EGA is a European genome archive. European, it's not a genome archive. It's another word for it. Anyway, it's the controlled access part of, at the, it's a European genomic and phenomic archive. That's why it's a funny title because it's missing a letter. And so forth. The binary RefSeq I mentioned, it's like the, it's the best record from GenBank that got sort of the RefSeq label. So everything in GenBank is owned by the submitter, right? So if I submit something to GenBank, it will belong to Francis for the rest of my life. RefSeq is owned, the whole RefSeq database is owned by NCBI. So what they did is they went into GenBank, they picked their favorite record for their favorite gene and they called it, they made it their own. So they made a copy, they referenced where they got it from, but then they said this is our copy. And so we're making the reference genome, it's the best reference chromosome, it's the best reference mRNA, it's the best reference protein, the best annotation and so forth for that gene. And so some of it is stolen from, taken, borrowed, shared from what's in the GenBank, but the rest is basically from the curators at NCBI. The genes database at NCBI is probably the best repository for all the information about genes that there is out there. Taxon, as you can imagine, is about taxonomic and all that, there's about half a million different taxes that are represented in nucleotide and protein sequences. And NCBI is a repository that all other databases, so they all use the same taxon IDs, they all use the same taxonomy. We only care about one taxon in this class for this course, that of Homo sapiens, but there are many other taxons that are just out there. OMIM, has anybody ever used OMIM? No, what OMIM is? What does it stand for? Online Mendelian. Online Mendelian inheritance, man. So it's basically, it's a database, they go across the chromosomes and they try to assign a disease, a function to every gene in the human genome. And so you can imagine it's evolved quite a bit with the sequencing of the human genome, but there are lots of genes for which we still have a chunk of DNA for which we don't know, we don't have necessarily any disease associated with it, cancer or otherwise. And so there's, but it still maintains, it's still quite used extensively. The mods are also very important, so the model organism database. And so why are the mods, do I have a slide? Yeah, why the mods are important. And so you have fly base for Drosophila, you have MGI for mouse, RGD for rat, SGD for saccharomyces for yeast, worm base for C. elegans, ZEDFIN for zebrafish. And then on top of that you have genontology, which is basically trying to describe every gene, function, product and so forth, and to organize, harmonize way, and through an ontology, so that you can assign a specific activity, enzymatic activity for example, in one organism and it will have a different organism, it will be the same enzymatic activity and therefore it will have the same go term. And so genontology or go has been used to ascribe a gene product description. And all of these genes are important, all of these mods are critical in our understanding of the human genome because through similarity searches, through interaction database, through pathway sharing and so forth, we're able to assign a gene from a model organism on which it's a lot easier to do lots of experiments, deletion experiments, fusion experiments, modification, modifying the active sites and things like that. And then in a mod, and then doing the assays and tests in the model organism and then projecting that information onto the human genome. And that's how most human genes have been interpreted is through experiments done in model organisms. And so sometimes you have to do it in mice or a mammal or something like that. But a lot of genes in yeast are actually have orthologous functions in humans and so is some really classic yeast experiments. And you can tell I'm a former yeast person. And I've been done in the yeast genome and so that a lot of function of human genes have been done through work in yeast. And so once a year, the Nucleic Acid Research Journal publishes a database issue. And in this database you will have sort of the creme de la creme of all databases in the world. And many of the top tier databases are invited every year, like NCBI, EBI, DBGA, those guys, the protein structure databases, RNA, PFAM, RFAM, all those guys are invited to submit an article. And they're usually short, they're about three or four pages long. But many other databases are there and want to be included in this issue. And this issue has become quite big. Although it's only electronic, now you used to print quite big, but now it's only electronic. But so what they have to do now is now all databases, and once you're in one year, then you sort of automatically, you were re-invited the following year if you were still active. But that doesn't sort of grew too big. And so what they've done now is some databases are invited every year, but only a small handful. All the other ones are invited every second year. So you have to look at two years' worth of journal to see the full database set. But then after in the two years, if you look at the last two years, you have the current sort of crème de la crème. It's not every database, but it's the top ones that are being used in the community. And it's usually, the editors are pretty strict about their inclusion and having enough people that actually use their resource. So it's one thing to publish a paper to get a database published. It's another thing to get people to use it. And so the ones in this journal are heavily used. I actually work in another journal from the same publisher. So this is Oxford University Press OUP. I work on a journal called Database. So this one you can submit to once a year. Our journal you can submit any time of the year, FYI. So quickly sort of switch, look at the file format or a file format. And it's not that it's that crucial for the work that you're going to do this week to understand how files are given, the GenBank flat file format and how important it is. But it's important to understand sort of in general how things are organized in a record. And it'll take the GenBank flat file as an example. So in the GenBank flat file, you have a header which will have the title, which will have a citation, which are things that affect the whole record. And so the citation is not just part of the nucleotide or so forth. It affects the whole thing. And then you have specific things within the record, and these are the features where I say this part codes for the RNA, this part codes for a promoter, this part codes for this and that. And so there you'll have different features that affect different parts. Although I look at the first feature, which is the source, which actually is attributed to the whole record. Then you'll have the actual DNA sequence. And like I mentioned before, this DNA sequence is, you have no quality scores or anything like that. It's all assumed in GenBank. It's assumed to be of one quality. And the GenBank flat file is, you may not think so, but it's actually meant to be a human readable format. And so there are many other formats that you can look at sequences and looking at specific genes and specific records is sort of almost old fashioned. And so I'm sort of aging myself talking about that right now. And nowadays, ways of looking at a genome will be through a browser, through a genomic browser type of activities where you'll see little arrows where the genes are and so forth. They'll mean much more macroscopic vision than natural nucleotide or gene at the gene level. But GenBank is still receiving lots of submissions from many organisms. And what many things you can submit, you know, genomic sequences, transcriptomic sequences, you cannot submit protein sequences. Your protein sequences are always deduced from a nucleotide. And there are very, very few protein sequence only submission databases. And there you're entering more into the mass spec type of resources. And NCBI does not do that right now. So quickly, accession number space, it's sort of historical, but there's one important thing to realize is that it's expanding. So they ran out of space basically when they first thought about it. The model said our accession numbers will be one letter, if I'm running out of time, 15 minutes. I'm always running out of time. I remove lots of slide. But it's usually a letter, a number and so forth. But now the important part is that there's a dot version. So you'll have U12345.1. And what that means is that's version one of that record. Dot two means that, and this is part of the model, is that the nucleotide was changed, not the annotations. The annotations may have 20 other annotations, but the sequence hasn't changed, it's still dot one. But if you change a nucleotide by one nucleotide, it becomes dot two. If you change the DNA sequence by 10,000 nucleotides, and you do it in one shot, it becomes dot two as well. The increment size of the version doesn't tell you how big the change was. It just tells you there was a change. And why is that important? When we look at the genome, the whole genome, we know exactly where we are in the genome. We know the assembly, we know the exact position of where that gene is. And that gene is there of that version of that assembly. And if the assembly changes, it becomes dot two as well. In assemblies, we talk about patch level. We talk about version, GCRH 38 for example, patch 22. So there's been 22 patches. What that means is that during the last few years, a patch will have changed within a region, the sequence, but it would not have changed the distance outside of that region. And so a patch will change. Let's say one gene, you'll change all the letters within one gene, but it will not change where the gene starts and where the gene ends. That will stay fixed. And then within the patch, you'll have the changes within there. RefSeq, I mentioned, also has accession number, but the accession number looks different. And it's different from all the others. It's not a letter and then numbers. It's two letters, N for NCBI, G or M or R or P, and then underscore and then a number. And those are RefSeq. So if you see one of those N star underscore and number, you know you're dealing with a RefSeq accession number. And they also have dot versions as well for them. So all that is described in the NCBI paper. They all have genome browsers. They all have, I mentioned, UCSC is probably the most widely used one. Ensemble is heavily used. And NCBI, MAP viewer is, for some reason, nobody likes it. But that's that. And of course ICGC, which I'll talk about now, it also has their genome browser. The information about a gene, I mentioned gene, the gene database in Altrey is the best source of that information. And a lot of resources like the ICGC will point to that resource. So I mentioned assemblies for the human genome, or actually for many of the model organisms. So there's an assembly for the human genome. And right now we're in HG38, or HG, which is the GRCH38, which is the most recent, so it's from December 2013. And so that's important to know if you're doing any query on the human genome, which database is this genome using, which reference genome are you using, and so you know where you are in the world. So it's like, you know which planet you're on. So historical perspective, so we had express sequence tags in the 90s. We had genome mapping and sequencing. Again, late 90s. Population analysis, polymorphism, all the SNP databases. GNY and GWAS studies, and then the infamous homer paper that Mark referred to, which basically closed down everything and made everything sort of private and under controlled access. So minimize usage in my mind. So the example that he talked about is, you know from a very few SNP, you can know if you took part in controlled or targeted, and the paper that the homer paper referred to was a study on schizophrenia. So you knew if you had SNPs, and not that if you're SNPs, you're part of that study, you knew are you the controlled group or are you the schizophrenic group? And people didn't want to have that information out there. I don't know why, but anyways. And we had a cancer genome analysis pilot. We had the 1000 Genome Project. Those are all open genomes, actually, and they were all consented that way. The Cancer Genome Project and the ICGC came in sort of late 2000. There were a number of large-scale, whole genome and cancer genome analyses where they're sequencing several genes, not all genes, but many genes, and some of them quite deeply in many patients. And they were making, I see there were some things that were happening, that many mutations were happening, not certainly in the same gene all the time, but in similar pathways and for certain types of cancer. So there were some hints of things that could be quite interesting, and so they decided to do a much larger project and start off the ICGC. And so the project in the ICGC was to collect 500 tumor normal pairs from 50 different cancer types. So that's 25,000 tumors and 25,000 controls. So it's 50,000 genomes that were sequenced for this project, and that was a 10-year, the idea was to do that in 10 years. And initially, they were hoping for every one of those, but it ended up being only a subset to do genome, transcriptome, methylome, and clinical data. So they got clinical data on all of them, but some of it not very deep. Methylome was about, I'd say, 20% plus. Transcriptome was about 30, 40%. And then genomes, they all had genome, but about 10, 15% had whole genomes and then the rest was exomes. And so there was partial. And the idea is to look, if you look at 500 of a certain tumor type, you'll see common mutations in a given tumor type. The big assumption, and the idea is you will see why 500, what that number comes from. The idea is you will see events that happen in 1% of the cases at the very least. So if it's 1%, if you sample 500, you should see it at least once, 19 times out of 20 or whatever. And so that's where that number came from. The big thing that they didn't think about at the time they were planning this experiment is that, okay, I'm going to collect, let's say, 50 or 500 pancreatic cancer. Do I have, do I have in my population that I'm looking at, do I have one type of pancreatic cancer or do I have subtypes of pancreatic cancer? We now know this, for instance, for breast cancer. Breast cancer is 10 subtypes or more. And so you would need to get 500 of each subtype to see events that are 1% or less. If you want, if now you have 10 subtypes and you've collected only 500, you're not, you're only going to see events that have happened 10, 15, 20% of the time. So you're going to be missing a lot of things at that rate. And so some cancers did get, so breast cancer is a good example, where there were many breast cancer projects. And so there, they have 4 or 5,000 breast cancers. And so there you can sort of see the subtypes and so on. And some of them are actually subtypes specific. Yes? So I just, maybe I just missed that point. I think it's that, for example, you're looking for kind of persistent patterns of the mutation you see in your community cancer. Yes. So some of the conclusions that we take, for example, that mutation causes that cancer? Well, so it's involved. So the causal is another step that you have to do in your analysis. What you can do in cancer and genome analysis is you can go look at signatures. You can find patterns of mutations, which can sort of give you insights. For example, the skin cancer, you can see a UV light signature. So UV light hits the genome a certain way and you see certain transitions that are specific to cancers caused by UV light. So you can't see those things. So there you can do a cause and effect. In lung cancer, you can sort of see a cigarette signature as well. But then you have non-smoking, small cell lung cancer, which is another type of cancer, which is another that non-smokers get. So they have a different signature, of course. So there's been a signature that most of us are just going to look for the cancer involved, those mutations, which are also non-smokers. Does the cancer consist of active mutations? Yes. And so, again, we can talk a little bit more about it, but yeah, that's the idea. And so this project is humongous. And so it's no one country could do it, although one country did try to do it all, but they quite didn't do it all. And so it's international. It was led out of here, out of the OICR by Tom Hudson, who was a director at the time, and it involved a large scale of people from all over the world. But it allowed us to do standards. It allowed us to do controlled access data in a centralized way. It allows us to use standards on production of data and so forth. And so tumor information, patient information, samples, how to collect it, sequence, and standard formats and analysis, and so forth. So this is from about, yeah, from May 2018. This project is basically finished and it's involved a large number of countries, but you can see on your left side is that NIH funded, and CI and HRI funded almost half the projects in ICGC. And most of those projects are TCGA projects. So TCGA is part of ICGC. And I would say out of the 50, yeah, less than half, but not much less than half. But you have every, you know, Canada did people in BC, Toronto involved in the project, Brazil, every continent is represented except Africa. So there's nobody, unless you count Saudi Arabia, sort of, there's Northern Africa. And so most countries in Europe were involved, Asia, China, Japan, Korea, Australia, and so forth. And we would do a meeting every nine months at one of these countries. And so it's a lot of traveling. And so this is the growth. This is a classic. It could be a growth of GenBank, it could be the growth of ICGC, the rest of five minutes. So all this data is hosted at ICGC and it's a portal that has access to controlled access to DACOs, data access compliance office. This is how to get access to the data. There's the data portal. And so there's lots, there's cancer projects focused. So you can look at what a cancer project has found. There's a project entity. So like all the breast cancer projects can be found together. Gene entity page. So all the pages that have all the mutations for a given gene is present here. It has a react on pathway entity page. And so we'll talk about that later this week. And we'll rob and we'll tell us more about this. Mutation. So every mutation in ICGC is tracked. And then you can do some query, complicated queries about this tumor type in RNA and genomic in this tissue, not that tissue, that sex, not that sex and so forth. There's ways to do data analysis. There's ways, there's data repositories where you can go get actual data from where, there are many places where the data is kept and so it's all maintained there. And so the main one we talked about is TCGA and ICGC and all they come together, all the open data. So because mutation is all open data, right? You don't need controlled access to have access to mutation. Because mutation, yes, I know, she's doing a good job. She's already been, yes you are, I am too. And so ICGC, the BAM file, which has all the genetic information which is the controlled access information, the germline information is kept at the EGA in Europe and in TCGA is kept now in Chicago in the US. And so TCGA I mentioned is part of ICGC. They have differences of tumor types and so forth and they have, as Mark mentioned earlier, they all have what's open and what's closed which is slightly different. And the big difference is that NIH and open access, a mutation that comes from a whole genome is not considered open data. A mutation that comes from a whole exome is considered open data. And the reason is that they think that whole genome leaks out too many germline variants by accident and therefore because of that they control access data. And so if you're looking at data sets and you're thinking, oh, there's lots of stuff in Europe and Canada but nothing from the US is because maybe you're looking at a whole genome and you're looking at an open data set and you're looking at a whole genome data and therefore it would not have any US data in it. So you have to keep those things in mind. You have to understand the data model. And now all the mutations end up in a cosmic database which is a European Sanger database which is different from EBI next door that maintains the cosmic database. Yeah, so that's there. Mark had these slides. So basically you control access. You fill out the form with who you are and if you try to submit the form and you haven't filled out the stuff well then it all shows up in red. I said, oops, I got it wrong. I mean, I'm feeling happy. I'm sure I have a little happy face. And you submit it, you sign it and then you get it signed. The idea here is that you're agreeing to all the things that Mark talked about and you don't want, you're getting signed by somebody who can fire you. So should you digress and do something you're not supposed to do then somebody who's your VP scientific, not scientific, but your VP research at your university or your big boss or whatever will get a phone call from somebody at ICTC from McGill probably and tell you that you haven't done the things properly. And so there's publication embargo but now actually this is not relevant anymore because all the stuff has been online for so many years that it's all there's no more embargo on it now. All the ICTC stuff is free for you to publish on and it's available from two different archives. So we're dealing with big data and big data evolves over time. On the top there is 5 megabytes that's getting lifted into an airplane and we have male attachments now 5 megabytes, right? And below is a 10 terabyte hard drive which fits on my desk but to age myself so 10 years ago I got a grant, a CFI grant for a one terabyte hard drive and it cost me a quarter of a million dollars. And it was the size of a fridge. So the big thing that we talked about also today is that the size of the data sets are getting so big like we're talking about petabyte scale and that's why we have to go to cloud computing because you have to now take your computes to where your data is you can't download e-notes relatively easy to download to your laptop it's just too big to do and so you have to do the thing on your at the cloud because that's where the it's a lot easier to transfer a software than it is to transfer the whole data set. Lots of documents available online PCOG which I won't have time to talk about so what I'll just say rapidly PCOG analysis we did an analysis of 2800 tumor normal whole genome pairs so it's a whole genome analysis it's a separate data set it exists in a number of places as PCOG so if you see PCOG you know what that is and I invite you to go look at so many of the papers in PCOG there's 81 papers or so right now search term but basically all the PCOG projects have been are published around bioarchive and are available for you and they're all coming out in nature and journals and nature publishing group and so forth there's a description of PCOG on the ICGC website there's a great portal so the portal for TCGA was actually built by the same team that built the portal for ICGC so if you can maneuver the ICGC portal you'll be very good at maneuvering the TCGA and the cancer data portal at NIH and this is actually and the GDC is in Chicago but it's NIH funded cancer data so all the data is the URLs are here the collaboratory is another project out of the ICR which will be in the workshop probably next year and so just to tell you it's coming down and so the challenge is open data is controlled access data is not enough eyeballs on the data eyeballs on the data are needed to make discoveries and so we need to increase the culture of sharing openly public funding agencies consortium, mentors, peers the new generation versus my generation which is not very open I would say except for me and it has to become the norm and final thoughts access to data is essential for science getting data that is fair is hard work I didn't talk about fair data but that's important it's essential to share the work you do with if you want to be recognized get tenure, get a job promotion and so forth human data is more complicated grant you that but we have to find ways to make it more open and more usable and there's ways to think about that and discuss about that there's a lot of material out there I learned from it and cite your sources my last message to students and young PDFs and the investigators be open so people can see how great you are thank you