 So, today, as all lectures at the CBW, we have the Creative Commons license that Anne talked about this morning. It's a CC buy and share alike, so basically it's sort of a good way of explaining it in a sort of viral, in a sense that if you take a slide from this, your slide deck is infected and then you have to share your stuff too. So that's the way you have to think about it. And you have to acknowledge where you got it from. In addition, I encourage tweeting cameras, pictures, I'll take selfies with anyone even very Canadian that way, and we'll do that. So this module is about databases and how sort of central they are to the work we do in cancer genomics and cancer bioinformatics. It's clear that a lot of things I'm going to talk to you about today may be different next week, maybe will be different next year, will change, things evolve. Some of the things will overlap with what some of the speakers have said this morning and will overlap with what some of the speakers the rest of the week are going to be saying. So repetition is a very useful academic educational tool and so we encourage it amongst our faculty and so we'll be doing a lot of that. And of course, this is only a one week workshop and so we can't cover everything and some of it you may find that we don't cover deep enough or with enough details and I encourage you to contact us or use some of the resources that Anne talked about this morning, Biostars and so forth. Some disclaimer is I don't profit from anything I'm going to mention. For example, I do not have any stock in Amazon, which I did. And so brands, products or whatever that may mention, so I don't benefit from that. I used to work at NCBI many years ago and it's a company institution place. I liked very much and I very much encourage and support their products and have over the years and I'm not sorry to do that. Likewise, I was an OICR employee in Toronto for 10 years. I was associate director to of informatics and bio computing there and working with Lincoln Stein and a bunch of other PIs including Jared and Yuri that you're going to see this week. And so I'm going to talk about some of this stuff that people at OICR have done and again I'm very happy to do that. So as I mentioned this morning in my genome Quebec, I sort of this move to genome Quebec for those in the U.S. I don't know the genome Canada sort of enterprise is sort of like going to the other side. I'm sort of now on the it's like going to work for the NIH sort of funding program when you used to be a scientist your whole life. And so now I'm on so I'm a VP of scientific affairs, which means I'm a scientist. I'm still involved very much with the science that's going on. But where our activities is mostly in supporting scientists and helping scientists get money to do genomics and bioinformatics work. And so that's what I do at genome Quebec now. And I'm on Twitter and I encourage you to follow me there. And that's the also the Twitter handle for the bioinformatics workshops. So the objective is to review databases and some of them some of the main not all of course of the databases that are using cancer genomics cancer bioinformatics and we're going to after my lecture we're going to talk about our and it's going to talk about some visualizing tools. And so I'll talk a little bit about that as well and in how to get things the data you need you want to have access to in computational biology of cancer genomics. So why do we have bioinformatics and the reason the main reason is that we have open data and genomics and proteomics technologies. And so if we didn't have data we wouldn't need to have tools. And so because the data is sort of driving the tools and the tools and the data have been driven by the technology. So Illuminati technology sequencing technology microscopy all of these technologies generate data that have led to the development of tools that we have today and therefore the software that we need to access this tools. BLAST is a good example. How many of you use BLAST? Hopefully most of you and how many of you have heard of BLAST? So BLAST is a sequence searching tool. So you query with a sequence and you find similar sequence and databases. But BLAST itself was invented because there was a need to search sequence sequences and database, sequence database. And so the protein and GenBank databases which are open databases led to the development of tools like BLAST. So if it wasn't for GenBank, BLAST would not have been invented. And actually BLAST was invented at the same place where GenBank used to live at the NCBI. So these are sort of things that sort of came with each other. And many people in bioinformatics don't think of it that way, but bioinformatics is what we do in bioinformatics when we use software and we use databases, we're doing an experiment. We're testing our hypothesis. And so an example is you have a sequence and so you need to, which are like your reagents. So sort of things you do your query for in a database against. You do your method, which in this case would be BLAST. So you do a BLAST search and there's different very types of BLAST. You can do protein against protein, nucleotide against protein or protein against nucleotide. And those are various flavors of the same sort of BLAST algorithm. And then you get your results. And so you have to interpret your results. You have, and this is in a bit where you do your hypothesis testing. And so you have to know your reagents. You have to know the tools and the methods. And you have to know, you have to think about doing your controls. So for example, what's an example, or not an example, I'm asking you. I'm going to ask you a question now. Be ready. What's an example of a control you can do in a BLAST search? So you know what controls are in a lab. So you know you, sorry, so you could put a random nucleotide or random protein sequence and then you can search it. And there you don't expect to find anything. And sometimes you do find something. And that's a little misleading. But that's a good control. So another control would be of something where you know it's in the database and you search for that specific protein and you expect to find it. And if you don't find it, that means maybe your parameters are wrong, maybe your, but it's sort of an interesting. So you should do controls for things that you expect to find and things you don't expect to find. And you should not find them. And you find them when you're not supposed to. That's a, anyway. So just the thinking of doing a, you know, when you do a computational experiment, you have to think of it a bit sort of, those of you who are from a wet lab, sort of the same types of controls you can think about when you're doing, and this is where it goes for BLAST, it could be for any of the many tools that we're going to talk about this week. So in the big picture, which is, I know I'm starting you a square one here, but I'm happy to do that, is how do you define bioinformatics or computational biology? So this is going to be a think, pair, share. So that means you're going to talk to the person next to you and you can actually chat for one minute and you're going to write together one definition of what you think bioinformatics and or computational biology. And we could go into debates about bioinformatics and computational biology being different things or in the same thing. In my books, they're the same thing. And so I don't want to go into that kind of definition. But how would you define computational biology? And talk to the person next to you. Pair up, discuss. And yeah, you have to, you have to talk. You have to actually talk, yes. And write it down. The short doesn't have to be long. I'm not writing a book here, just a short definition. And once you figure something out, write it down just a few words. Yeah. And yes, I will be asking you your welcome. Or maybe you want the answer. You ready? Now you want the answer. Okay, so, does anybody have an answer? So how many people do we have in the room today? 40? Ish? 26. 26 plus five, so 30. 30 some people. So we would have 30 answers, I would say. And so that's fine. 30 answers. You can have 30 right answers. So maybe I'll ask you guys here, what's your answer? We said bioinformatics is using computation. You got this? We said bioinformatics is using computational and algorithmic methods to run a bunch of data systems. Very good. We'll talk about that later. Anybody else? Thank you. What we started out with data has been generated and they were going into modeling. You finish. I finish, I was gonna try to do something The use of computational methods to collect and analyse and interpret biological data that will help inform them. Similar to what they say. Analysis of biological data on the computer here. Did they really have a different mind? We have a different one over there. I'll show you my answer. My part is about integrating biological themes together with the help of computer tools and biological databases. Not all of you have databases, very important databases. And getting new knowledge about the system and study. And getting new knowledge about the system and study. This is a pretty broad definition. It involves a lot of things. Similar to yours, right? Is that what you're saying? But it's the software, it's the databases, it's the thinking and getting new insights into biology. The other thing that's sort of in the back of this bioinformatics definition is that it's really, even though we do a lot of IT, computer science, math, and blah blah blah what's driving the science here is biology. So the biology, it's a biologically driven sort of activity. We're using computer tools to do biology. The same way chemists or biochemists use chemistry to do, understand biology and so forth. And so there's a concept of big data also. So that's an evolving, what I'm talking about today will be different next year. But at the top there is a forklift lifting at 5 megabyte hard drive into a plane. So it's a plane, it's from 1956. And of course here this is a 5 terabyte hard drive. That is 6 Nets total of my laptop. And that's from last year. You can get even bigger ones now, or smaller ones for the sake of more data. I actually, I mean this is not, I'd say, 15, 16 years ago. I purchased a 1 terabyte hard drive for a data center at the cost of a quarter of a million dollars. 14 years ago. So it's, things have evolved, things have evolved, and they will continue to evolve. So this is a related to a slide that Mark had over here today. It's the whole idea of cloud computing and the size of data sets and what it means to move data from one data center to another data center. One of the reasons why it's going to become impractical to do so is that it's going to take years now to move large data sets. And 5 petabytes is about the scale. It's a bit smaller than that, but it's the order of magnitude that the ICG data set is in right now. So this is the one between 1 and 5 petabytes. And so that kind of data size set is not going to be movable into a data center anymore. And if you want to have access to the whole data set and if you want to use, you want to be comprehensive. If you need the whole data set for the kind of questions you want to ask, then it's going to be impossible to one host and two to get. Because you're planning using the folks at Amazon, if you actually talk to them about moving large data sets like that, you tell them it's much faster than just to ship the hard drives. And they say, we're really good at shipping stuff. And so if you have this in one place and you want it to move to another place, one is to physically move the data on an airplane as faster than using the internet. Or to actually, and what has become the solution now, is to actually move your software to where the data is and do the computes there. And that's what we're going to do this week. And we're going to do it in a couple of environments. So the first part of the week, we're going to do that in Amazon, sort of commercial cloud, public commercial cloud provider, Amazon Web Services. And in the second half of the week, we're going to do it at the Collaboratory, Kessler Genome Collaboratory, which is a resource at the University of Toronto that's been set up by folks at the OICR. So it's going to be, you'll see very different access, somewhat different access, somewhat different, what's going to be the difference though is which data set is where and which data set is present on Amazon versus on the Collaboratory and which tools are easily integratable and pipelines and so forth. And so you'll be playing with that this week and you'll have a chance to study that, to see the differences. So what I just said, data sets are in the petabyte scale, if not exabyte pretty soon. And data and the security rules, the things that Mark talked about will be somewhere and not in your own hands. So basically it's a lot easier to apply standard sort of rules for accessing the data around where the data is, as opposed to supporting those rules around your laptop. And you should not have any of this data on your laptop anyway because it's, you know, stealable and not physically secure and so forth. And so not necessarily the data we're going to study this week, so this week is going to be, it's all going to be, we're going to talk about controlled access data, but the actual data you're going to be using is going to be open data because we didn't get you permissions to have access to controlled access data, but you'll see how to do that in lecture today. So there's, we talk about openness. There's, there's all sorts of different things. So there's open access, usually refers to publications. There's open source, which usually refers to source code, open data, which refers to data, and then open courseware, which refers to the material that we teach with here in this workshop. And so we talked about, so, so the bioinformatics creation is databases and the way this information is organized is sort of a key thing to understand what is, you know, usable for, for a given database. And, and it sounds a little crazy, but it's, if you put something in database, you should be able to get it back out. There are, there, I've had experiences of large scale projects where we've put things in databases and then we can't find it anymore. And it's, and it all has to do with identifiers and fields that the metadata and where things are at. And if that's not kept track properly and the wrong identifiers put in the wrong field and then you use your query engine, which just looks for that metadata in a certain field, but doesn't look at any other field and it doesn't find it. And this has happened in large organizations, which will remain nameless because it's too embarrassing, but it's some of the big players that are on this planet in the bioinformatics world. Maybe she'll give me a beer, I'll tell you. No, you don't buy me a beer, but if I have a beer. So obviously, once a database is well organized and well structured and it sort of standardizes ways you can talk to it, then you can write tools that will do that. So it's not just you doing hand queries through a web interface. You can do write API. So API is an application programming interface, allows you to talk from software to talk to your database and generate your own interface, your own website, your own data sets and so forth. So having that available is a really sort of key thing to having a good database. And a really sort of key thing on organizing a database is what's your data model. And so here's an example of a data model would be that whenever I change the DNA sequence and my DNA sequence database, I will modify the unique identifier for that sequence. Because the unique identifier for that sequence is x, y, z, let's say those three letters, and I change one nucleotide and now to know that the system has changed, my data model says I will change the identifier for that because it's not the same sequence anymore and therefore it has a new identifier. And so databases like Embo and GenBank have done that now. It wasn't like that at the beginning, but now they have identifiers that change whenever the sequence changes. And so that's part of the data models. So whenever you change a sequence, are you going to change the date stamp on it? Yes, we're going to do that. If you change a feature, but don't change a sequence, do you change the date stamp or the unique identifier for the sequence? All these rules that you put around a database is important to sort of define up front before and you publish those ideally or you at least put them on your website to explain to the people what these various fields mean, what is the purpose of this database, what's being maintained, how frequently is it getting updated, and all of these things. And so those are key things in thinking about a database. And so just a few images to sort of explain some of the things we think about. So the metadata, which is basically data about data, is you can have things like create dates, update dates, submitted ORCID. Does anybody know what ORCID is? Yeah, it's an identifier which is unique for its researcher. Researcher, yeah. So each of us, we all have an ORCID ID. If you don't know your ORCID ID, you should know your ORCID ID. If you don't know, you can go to the ORCID website and they can tell you what your ORCID ID is or you can register and have an ORCID ID. And that, soon all journals will require that anyway. So when you submit a paper, many of the journals, definitely the Plus family does that and the BMC, many of the open access journals. And ideally, this will percolate to things like PubMed and so forth, while you'll be able to query by ORCID ID for a given investigator. So it'll be a lot better because there's some names you can imagine happen very frequently in a databases but represent different people. They just happen to have the same last name in the first initial. And so this will protect for that, against that. So publish your book title and so forth. So the data would be a DNA sequence file that would be an example of data, a cosmic record. Does anybody know what cosmic is? It's a somatic variant database. So I'm going to talk about it a bit later. Protein-protein interaction data. Do you know what that is? We don't talk about it in this course. So interaction database intact is the biggest one probably at the EBI. And that's a database where you say protein A interacts with protein B. It's usually protein-protein, but they do protein DNA, protein RNA, protein small molecules. And so this entity interacts with that entity. And so the record is A binds to B. You may have more metadata saying which part of A binds with which part of B. Or A binds to B only when B is phosphorylated and so forth. There's all sorts of metadata you could put around that. But protein-protein interaction database is an example of a record. Titles of a book or books themselves. So a storage system could be a box on your shelf. That's a storage system. Oracle, MySQL are sort of commercial and open access relational database systems. Binary files could be a place where you just a file system, text files, and bookshelves. So the query system could be a list you look at. So you have a list of all the things that are in your bookshelf and you say, oh, I'm looking for this book or where is it in my list? And so you can just look down the list and that's sort of a query system. A catalog is a query system. An index file is a query system. SQL, structured query language is a query system. Elastic search, which are now used in cloud computing, are systems that allow you to look at big data is another query system. And grep, which is a UNIX tool, which you've all practiced last week, right? Grep? Yes? OK, good. We'll test you on that later. No? You forgot? You already forgot? Grep? Grep is a good one. Anyways, we'll come back to grep later. And then the information system. So the information system is like the whole enchilada. So you could Google is like an information system, right? The Library of Congress is a library system. It's an information system. Entrez and NCBI is information. Ensemble at EBI. UCSC Genome Browser is actually a multiple browser. It's multiple organisms. It's a whole sort of information system with the parts talking to each other. And ICGC is also a big information system that we'll talk a bit about today. So the information system is sort of the big envelope, the ship that you label that has a reference often that you can refer to. But under there are many layers of things that are sort of key to understand as well. So NCBI itself allows you a system to submit data, to download data, to learn. It has a whole teaching section. It has a whole development and tool section. It has lots of analysis tools and research and places where NCBI itself is a core group of people that do research and so their research is available too. So we talked about the DNA sequences and how they're shared and so forth. And actually the nucleotide sequence are part of this international nucleotide sequence shared system where three centers, so the Europeans, the Americans and the Japanese share the data, the content of their respective databases. There are three points of entry into one database. So sequences submitted either through Europe or Japan or the U.S. Bethesda NIH at NCBI ends up into a GenBank ENA European Nucleotide Archive or DDBJ DNA database of Japan into one repository. And they each issue accession numbers for records that were submitted to them. But let's say a GenBank accession number will be in the ENA database and it will be queryable with the same accession numbers. It doesn't get a European accession number because it got one already from the U.S. And so these three databases sort of sync up every night and so they're actually within a day they're the same database of each other. And so it's a really sort of good example of sharing the load of submission, updates and maintenance of the archive and the whole repository across three different countries. It's sort of... And the other sort of benefit there is there which was not thought about I think at the time was actually you have basically you have a help desk and three different time zones. And so you have... whenever you're awake you can sort of call somebody to help you and hopefully if you're only awake in the middle of the night and you're in North America you speak some good Japanese and you can call the Japanese people. But I mean they all have of course they all have email and so you know but there are basically this help around the world for this resource. NCBI of course has many other things besides sequence data and so if you go to this query page you can actually and you put in the query all filters in square bracket it gives you the number of records in each resource and so it gives you an idea of how big... if you look at previous lecture notes I have this slide every year so I update it every year so you can go see how much of the databases you want to do that exercise see how the resources have grown over the years. And so NCBI offers PubMed and PMC, DbGaP and ClinVar for health data, WGS and RefSeq as an example is the genomic data they have sequences and 3D structures in the proteins world and they have in the chemical world they have PubCam and biosystems so it's just a very sort of quick overview of the types of things that they do and now I'm going to switch a little bit and talk about file formats and specifically talk about DNA sequences and I'm not going to even go... I know for example Jared tomorrow is going to go into details of the BAM file format and I'm not going to do that today but it's... and VCF I think also others later in the week may do VCF files so you'll have that throughout the week and all the various file formats that you have to work with right now there's actually on the UCSC website where the UCSC Genome Browser lives there's actually a really good sort of table which has definitions of all the various file formats or many of them that are used in the DNA world and RNA in the nucleotide world and I invite you to go look at that if you're interested in that the databases are now... I've sort of split them... I've sort of done that for a few years now sort of split them into two sort of classes two categories one is like a primary or archival and the other one is secondary or curated although some sort of fit in both sides and so GenBank is not really curated although there is some curation that goes on and curation being sort of what Trevor was talking about is sort of the handwork that has to be done on records that it's hard to automate and so forth and so archival tool like a resource like GenBank Uniprot is a protein resource PubMed which is all the abstracts PubMed Central which is all the open access journal abstracts Intact is a protein-protein interaction database ICGC, the data for that EGA is sort of the... it's not sort of... it's the equivalent of DbGaP on the European side and so then secondary databases where some curation takes place RefSeq was actually one NCGI was an invented RefSeq was a way to take... so the issue with the GenBank records is that they're archival and they belong to the user, the submitter that submitted that record and then NCBI wanted to have a system where they wanted to update those records and they couldn't update them because they belonged to the submitter but what they did is they didn't... but they're part of an open access database so what they did is they took the best and also for example BetaGal which is a very common gene in the bacteria for which there's probably it's been sequenced 10,000 times and so there's 10,000 entries of BetaGal in GenBank but what we at the time wanted to do with RefSeq is to have one copy of everything so you want to have one copy of the E. coli genome and all the genes therein and one copy of the human genome and all the genes therein and every organism, mouse and so forth and so RefSeq is the reference sequence of each organism in a database where... so how does NCBI get that data? Well basically they go through the existing companion and they take the best... what they think is the best one and they annotate it in the best possible way so they curate it and they curate it in a standard way across all organisms and all across... and they'll be annotated the same way and so RefSeq is really... so if you talk about the mRNAs RefSeq sequences that are currently in in GenBank or not in GenBank at the NCBI is a copy of every mRNA for which there is a copy so it exists in real life and there is a copy but it references an existing record that was done by some other group they've taken from other groups and built a uniform... uniformly annotated data set from existing stuff that's in GenBank and so that's RefSeq and so genes is sort of... similarly it's a curation of all the genes in the various organisms Taxon is a curation of all the different organisms themselves Unipro, OMIM is online Mendelian Inherents in Man it's basically the phenotype data associated with all the human genes and so it's sort of a disease genetic disease associated with all the various genes and so forth the Mauds is a model organism databases in ICGC so why are the Mauds so important so the model organisms databases are so important because that's where all the experiments have been done like fish, sea elegans, yeast and so forth and so there's a lot of inferences understanding of what genes do function has been done on the model organisms and then there's homology between the model organism genes and the human genes and so then you projection of not always correct and sometimes needs to be validated some other way but there's often... we speculate on what the function of that gene is in humans when you know what it does in other organisms and so that's how it's basically the function determined through the Mauds helps us understand what it does in humans and then NAR in a nucleic acid research journal has a yearly database issue that used to every year publish the top 100 or the top 20 databases that they publish every year and it would be the same ones and then so few new ones would get added over time but what's happened is that that issue got so big they had to split into two so there's a few articles which are repeated every year but most of them are new ones for that year so any database that's sort of in the in-group gets re-published every second year and so if you want to look at all the databases the top databases in the world you have to look at two issues worth of NAR this year and last years and so in January every year there's a new issue so NCBI, EBI, and DDBJ and a few others get included every year but all the other ones get included every second year so for example this is January 2017 and this January 2018 you'll have the two issues that cover the last sort of scale of databases that are and here we have databases on every topic so from a lot of cancer stuff of course but also nucleotides proteins, protein interactions 3D structures and RNA various types of RNA and so forth so I'm just going to spend a couple slides on talking about GenBankflat file so I used to be my Java used to be in charge of GenBank when I was at NCBI and it's basically the GenBankflat file it's funny I was at a conference many many years ago where they asked it was a bioinformatics conference how many people have parsed GenBank as I've written tools to parse GenBank like the whole room rose their head and the thing is the GenBankflat file is written in a format not to be parsed it's meant to be read by humans it's not meant to be parsed by computers but of course there are many people who have done this activity over the years to try to interpret or to store to have their own copy of GenBank and with with whatever tools they have that said whenever you have a gene or RNA and things like that or a protein it will be represented in GenBank format and it's useful to understand sort of how this database is organized so there's a header part which basically has information that affects the whole record and so it will be the title of the record it will be any publication it will be the organism things like that that affects the whole record then there will be features which are location specific within the record and so there will be there's a protein from here to there that's encoded there's a an exon from here to there there will be that kind of information and then there's the actual DNA sequence of course that's what I'm showing here is a format for a nucleotide record there's a gen pet format also which is basically the protein version of this but it's the same idea applied and the thing to remember about GenBank flat files is that it's meant for humans to read it's actually very easy to read GenBank file it's not very easy for a machine to read and so because there's there's no delimiters there's a bunch of things that are missing from machine to be able to read this that requires natural language processing to be able to read properly and so it's where people submitted sequences to now it's becoming less it's still active place where people submit to but it's less so on the human side because people are actually sequencing the sequencing human sequencing projects are more whole genome exomes and anything like that and that's a whole different sort of pipeline it doesn't end up in GenBank ends up in other places and so so GenBank is less used by that by human biologists everybody else is still heavily dependent and using GenBank like I mentioned it's a place where RefSeq records are taken from and also genes information is taken out of GenBank small I'm going to quickly go over the next few slides there's the sort of the accession number space so all the accession numbers in GenBank have either a 2 plus 5 or 2 2 plus 6 or 1 plus 5 so that's a letter and 5 digits so 1 letter plus 5 digits or 2 letters plus 6 digits and basically and 4 plus 2 plus 6 now basically for the whole genome it's basically they're running out of space they're using up all the letters and so they have to add letters and so forth that's sort of the nature proteins have a smaller this fewer proteins and so they have a it's 1 plus 5 and 3 plus 5 only and and then RefSeq also have their space but RefSeq has a different nomenclature so it's n a letter underscore and then n number and so I'm going to talk a little bit about that and then now now as in it's been about 10 years now they all have a version number and so the version number means that if it's version 1 versus version 2 that means version 1 of an accession number versus version 2 of the same accession number means it's a different nucleotide sequence everything else will be the same except this time stamp time stamp will be updated as well but it means that the sequence is different so version 1 versus version 2 means it's a different sequence of nucleotide difference or it might be a megabase difference there is no 1 versus 2 doesn't say anything about that it just tells you that it's a different sequence of the same accession number so on a paper you'll have a reporting of an accession number and then it may not have the version number it should but it may not have but then when you look at it you'll see it's version 1 version 2 version 34 version whatever and so and the number, the version changes when the sequence changes that's all you need to know and so RefSeq I mentioned has also different letters so if it's a C that's NC underscore or something that's usually a chromosome NM is usually an mRNA NP is usually a protein and so forth and they also have a version so you'll have NP underscore 34 by 6 dot 1 dot 2 dot 3 means it's the first, second or third version of that protein when you do a query with just the accession number the database always reports you the most the latest version so that would be the highest number version often it's dot 1 because most this may come as a shock to you but most records in GenBank don't get updated I know it's a bit of a shocker but when they do then you know likewise for RefSeq if they get updated and they do then there's an increase in the version number there is actually some models so some of them come from gene prediction and those are quite not rare but they're not as common for certain organism and that's what the X means the WGS so actually people Mark referred to that following the Bermuda people would submit sequences without doing much work they would do a first pass assembly they would get some content then they would submit it to the database and they would do that sometimes as was mentioned on a 24 hour basis and so you get a lot of churning a lot of IDs and a lot of things happening and so that's what they invented a new sort of ID new space for whole genome sequencing and that's not as common for human as it is for all the other organisms and there's hundreds of thousands of those so there's a lot of things in GenBank and other so fast day files so yes yes yes so there's all sorts of reason I used to have slides in my deck where I actually I was responsible for a version number change so I discovered a mistake in a record and I emailed it to the database and they fixed the mistake which meant removing some nucleotides from the record and then it increased so there's an update most of the time it's a submitter who's found a mistake and they change they change fix it although it's not very common they may fuse two records into one and then so you take one of the of the previously existing numbers and you update the version of that one and then you put the other record as a secondary number so that there's no loose track of that and that would be important if they knew that the two records actually were one gene and so they instead of having one half of one gene in one piece of DNA in the other half of another piece they decided to merge it but the most common one not as much recently because people are much better at fixing those kinds of things would be contamination especially primers or vector at the end of sequences that weren't properly snipped away so they'll just clean those up sometimes it's 10 nucleotides, sometimes it's 100, sometimes it's 500 and so the records shrink by that much and so you have to change the version of the number with that as well. Okay? So FAST-A is a sort of a common file format that's used by the FAST-A program which is sort of related to BLAST and so it's a rapid alignment search tool but then the FAST-A format that was used in the FAST-A program became a standard, it was so simple that everybody loved it but then it became a standard format for a lot of tools and so a lot of tools use the FAST-A format as their input format this week we'll hear about FAST-Q and other file formats which are derivatives of FAST-A and so I don't want to spend too much time on that but basically you have a greater than sign you have a bunch of information and then you have the sequence file then if you have, you can do this greater than sign, the file name and then the data and so forth and the data the description can have various structure and NCBI does it one way and it's actually evolved to how it's doing it, it doesn't use GI numbers anymore but a FAST-A file at its core is basically greater than sign and then a nucleotide file and or but not together but it could also be a protein file so you can have a FAST-A or a protein file or a FAST-A of a nucleotide file it could be mRNA, DNA and so forth so this is respect to some of the articles the core articles I mentioned in an NAR database issue and this is a one from NCBI and they talk about all the various resources there exist a number of genome browsers out there the three popular ones I would say UCSC, ASSAMBLE and NCBI and they're in order of preference I'd say UCSC is the most commonly used one an ASSAMBLE sort of 10 times less in NCBI 10 times less in sort of that order and then a new one of course because we had to do one too is ICGC so ICGC also has a genome browser it sort of uses a lot of the features from the other ones it didn't totally reinvent everything but this is information about SOD1 which is a gene involved and often mutated in cancer best places to look for gene information is probably the NCBI and Altrey gene so there's a gene page which will have all the information about a given gene who about the gene an A structure that all the various transcripts about the proteins which protein interacts with what and so forth so if you're interested in a given gene and so on the ICGC browser page there's a link to the Altrey gene so that's just an example and so you can get more information about the gene of interest another important concept to know about when you're looking at databases and a very sort of important piece of metadata about again something Mark referred to as the reference genome and so there are there's a group whose job is to standardize the reference genome and so for humans it's actually controversial isn't a big word I wouldn't say controversial but it's a debated issue as to what's the best way to reference in 10 minutes okay I'm going to speed up very fast now but I like talking you may have realized I like talking and so this one is important because you need to know which reference you're looking at and when you're looking at a reference the challenge is of course we're all different and so we all have the difference different reference genome and so and then the first the original reference genome was actually made by assembling multiple individuals genome together so the reference genome actually does not exist in real life so it's not a real reference but that's a bit of a challenge but there's a website and a group that does reference genomes and it does as a page for the human reference genome and if you go to the UCSC browser there's a listing of all the various reference genomes they've used we're going to be using HD 30 or the other name is GRC H38 so version H38 and that's the one and that's already five years old and so what you may look at this table is that initially there were frequent updates and now the updates are becoming more and more apart the main reason is the reference we're using is pretty good and to change reference is a big big job and so to realign everything to the new reference is a big challenge and I hope I think Jared's going to talk about this tomorrow but there is really we're really stuck with our reference genomes in a way historical perspective in human genome data I'm going to skip this slide it refers to some of the things that Mark talked about also and large number of large scales were done in cancer and sequencing lots of genomes turned out to be very useful and out of that came the fact that we learned that there's a lot of heterogeneity amongst every cancer sample there's a lot of abnormalities and not just at the nucleotide level but the rearrangement and the way you prepare your samples the way you prepare your samples for your libraries for sequencing and so forth matters a lot and so out of all that we decided in 2007 to start the ICGC International Cancer Genome Consortium at the same time the US was doing the TCGA, the Cancer Genome Atlas and together so the TCGA is part of the ICGC but like I mentioned earlier there's different access rules but the mission is similar in that we wanted to sequence a large scale of genomes to do 500 tumors per tumor type to do 50 different tumor type so it's 25,000 tumors and matched with 25,000 normal DNA so it's 50,000 genomes that was done in 10 years or nine right now so it's going to end it's ending soon we're not quite at 25,000 but we're almost there and there's needs to be information about the project, the patient the tumor, the samples and all of that sort of metadata analysis and interpretation this is a worldwide so there was 18 countries lots of different projects some countries doing more than one the US and UK and France and China being examples of countries that have done multiple we did about three different projects in Canada and so forth and it's a map of all these various things so there's your typical growth curves with time you'll get enough data the ICGC home page has a page a project page for each of the tumor types and so forth that we're done you click on it and you can get details about the project you have information about the data portal the DACO data access committee data access committee organizing committee and login and so forth there's a DCC so data coordinating center where all the data exist and you can do simple queries ask for a gene and things like that or go to specific parts of the portal to help you find things there are summaries of all the genes summaries of all on the left side there's what we call faceted searches or facets so you go and click and right away that sort of focuses your query to the data sets you want and then you can download or study further those specific data sets there's a project entity page so there'll be the breast cancer and the various types of breast cancer there'll be pancreatic, there'll be liver and so forth so each project has their entity page there's gene entity page so each gene will have its own page and it will have a link to the browser link to all the various resources for genes reactome which we're going to talk about later at the end of this workshop also is integrated into the ICGC and each mutation has a single identifier so you can uniquely identify a mutation and that's kept from release to release and so that's part of the data model at the ICGC so that you'll have and I mentioned a genome browser as well and ways to access and download the data I'm sort of going fast because I'm running out of time and I still have a few slides but it's also ways to download large data sets and so you can download this and actually I'll show you in a few seconds how to download it to your cloud infrastructure there's also large data sets for example if you want all the open access mutation data on a single file for all of the cancer projects that's available on the portal as well and this is some Unixy greppy type stuff that you can do invite you to do on your own so if you download a file and you can count various how many mutations are in it for a given type for example as I mentioned ICGC is a merger of the ICGC and the TCGA data it merges all the open data but not the controlled access data and the controlled access data as Mark referred to earlier is all the identifiable human data so TCGA is part of ICGC they both offer but they have different tumor types different definitions of what is controlled access versus not controlled access and that's kind of important and different data access rules different geography rules and various countries under different jurisdiction so ICGC has rules of what is open and closed TCGA has different rules mostly the same rules but the big difference is that NIH so mutations themselves are not identifiable they're independent of the germline of course if you have a mutation on a germline position that becomes an identifiable statement because you've just said this one has been modified to where there used to be a germline variant so those are sort of in a gray zone but if mutations come from exome data they're considered open access whether they come from TCGA or ICGC but if mutations come from whole genome data if they come from TCGA that's considered controlled access if it comes from ICGC it's considered open access so why the same thing is called open and closed depending on where each country was made basically the NIH is somewhat paranoid and doesn't trust sequencing people doing sequencing and their ability to call a mutation and they think there's too much leakage of germline variants in a whole genome experiment versus a exome experiment so a whole genome experiment is a leakage of germline variants and therefore that file contains enough germline variants that make it controlled access you got all that? yeah, okay can ask me later so there's a file there which has all the germline variants but it only has a germline variants from exome from TCGA exome from ICGC and whole genome from ICGC but it does not have variants from whole genome from TCGA similarities we talked about that earlier we'll skip over that you have to destroy after your period of access to the data you're supposed to destroy it so again the whole concept of destroying what's on your laptop and the ICGC police being able to come over and see that you've actually done it is the audit that was referred to I don't think it's ever been done but technically it's allowed to be done but it's not been done Cosmic we talked about is a repository where all the somatic mutations exist from this project and many other projects as well so Cosmic is a repository of all somatic mutations from around the world not just whole genome, not just whole exome but targeted panels and so forth it's all at Cosmic and so that's there and then there's some files at Cosmic that you can go search and look at so quickly on how to access the data so as was mentioned there's a website you apply, you register you fill out the form you press the button, say am I ready then all the fields are red because you didn't fill them up properly and then you do it properly it's all green and then you get a PDF and you sign this PDF and basically you sign it by yourself and somebody at your institution that is able to fire you should you not obey the rules that are set to by the ICGC like not data sharing not re-identifying and so forth all those rules you agreed to that you said you would follow while you get it signed by your person in your institution that can fire you and so you make sure that you don't break those rules and then we use the open IDs and so that's sort of the whole open ID you can register and then once you're in a system you have access to that data there's a moratorium with respect to publication the idea is that if it's so much time after the time the data was generated then it becomes open by default and it's a bit complicated rule but then what we do is we have a list of which data sets are open and which ones are not so this is updated at every release so there's two places I mentioned so there's DBGAP for TCGA and EGA for everybody else so the non US data lives in Europe and I know I'm speeding through you'll have to speed through too there's lots of documentation and quickly I'm going to tell you about POG because that's going to PCOG because that's going to come back later this week it's basically a pan-cancer analysis a whole genome so it's we took all the ICG data sets that had a whole genome which is only a fraction is about 10% up to 25,000 so about 2800 were whole genomes and then we did a whole analysis on those similar to what had been done by the TCGA for exomes so these are the leaders of that project so Peter, Gary, Jan, Lincoln and Josh and looked at 2800 tumor normal pairs some of them had RNA some of them had epigenomic we had 17 working groups a lot of the papers right now this shows 28 but it's more like 42 now they're all in the publication stage or pre-publication so they're all many of the papers are in bio-archive and this was something I pushed for and happened so I'm really happy to promote here and we'll be the first sort of pan-cancer analysis of whole genomes with integration of RNA simple somatic mutations copy number variations, methylation and germline variants all the papers are going to come out of it are all going to be open access papers all the methods are all going to be in Docker so you'll be able to use them in Docker and you'll learn about Docker later this week they're also available in multiple cloud infrastructure so maybe Amazon's not your thing maybe you prefer the Collaboratory maybe you prefer another cloud so there's several options for many of them and so and PCOG data has 17 16 working groups so all the working groups have looked at various things and they've all published papers and the process of publishing papers just an example here I was in a technical working group and we sort of developed the pipelines for the alignment so we had all the genomes are all aligned through the same way and it turns out if you align things all the same way it's different than if you each let each groom align things each their own way and so we did it all one standard way for the PCOG project and it took a long time a lot of time was moving files around different clouds across the planet running various tools it took forever really Dockstore will maintain all our Docker containers and it's we'll talk more about that this week and here's information on how to get data and where some of this data is so the TCGA data now lives at the GDC the genome data comments at the NIH which is actually it's not at the NIH it's in Chicago but it's a trusted partner of the NIH the ICGC also data the non TCGA ICGC data lives at the cancer genome collaboratory and so we'll visit that later this week as well and there you have all sorts of tools and there are ways from the ICGC to find out where the data is you have lists of portals which portal which data set is that so the challenge I mentioned earlier today is about getting eyeballs on the data and you want to try to make things as open as possible but also you want to respect privacy and confidentiality of the patients and so a sequence is identifying to a patient be it GWAS or exome or whole genome matter and it has to be careful on how that data is shared but if you make a system that makes it too hard for students or students for people to look at then you don't get anybody looking at the data and then you've wasted a lot of money so I think we need to support and encourage a culture of sharing openly as much as possible keeping in line with the various rules and the consortiums I've done a great job of that as I mentioned access to data is essential getting data that is fair findable, auditable and so forth is also very hard work is essential to share the things that are shareable and last message so people can see how great you are very important message and these are all the sites for ICG's data set and this is the last slide I'm actually being nominated for Benjamin Franklin award and you guys go vote for me and so it's a so this slide is not on your slide deck in the printed book but it will be on the GitHub page and it's basically if you're on Twitter if you're on Twitter just go to my Twitter page at BFFo and you'll see a link to Benjamin Franklin award to go vote for me or for somebody else but vote please vote, thank you any questions okay we're going to move on you can ask your question during the coffee break I'm here so we're going to move on to the IGV section we have 20 minutes for 50 slides so but don't worry there is a hands on tutorial for this that you can ask more questions so how many of you have used IGV it's about half so this should be sort of a recap for some of you and Brandon for some of you so IGV stands for the Integrative Genomics Viewer it's a desktop application so you don't need to be on the cloud or anywhere to get it it's on your own computer and you can view basically any sort of genomics data on there so epigenomics, microarrays next gen sequencing alignments RNA sequencing pretty much anything that's out there you can put on this and see in a visual way so you can explore your large data sets you can integrate data sets and clinical data and you can also automate tasks so you can use batch files to have stuff do stuff while you're sleeping which is really awesome and that's like the great thing about bioinformatics it can happen while you're not at work you can get data for IGV from pretty much all the places FTP servers Amazon web server local files, TCGA, genome space anywhere you can pull a file in from you can get it into IGV so we're going to go through some basics of IGV just to get you started there are more things that you can do with IGV that I'm not going to cover today but this will give you a good grounding on where you can build off of so step one would be to launch IGV so you need to download this from the Broad Institute there are some buttons on the side and then you need to select the version of IGV that you would like the most because they come in different flavors so you would have to pick the one that matches your needs as well as the space that you have on your computer so if you pick one that's too large it's not going to run on your computer and it's not going to work, it's not going to be a good day so once you have downloaded it and you've installed it on your computer and you've opened it you can select your genome from the drop-down menu if you don't see your genome there you can upload or add a genome that you want and then you can load data on top of that genome so for this it is very very very important that the genome that you choose in IGV matches the genome that you have aligned your data to because if they don't match you're not going to see very much or the thing that you're interested in so you can load data from a server so in this example we're using file, load from server and then choosing in the server the tutorials there so what will happen is it will load up the files that you've chosen there so basically the screen layout is going to show you what chromosomes you have the data that you've loaded and all the different references that you load at the bottom as well which you can see on this one these are called tracks tracks is a fairly common terminology for genomic stuff and then you have your menus at the top and your toolbar there and then for the file formats it's basically any sort of next-gen sequencing file format you can load up there so basically any sequencing type format that you can think of it's there we'll be using BAM files and that's pretty much it that we're going to be using for this and some okay so to view your alignment so once you have chosen your reference sequence and you've loaded up your BAM file you also need a BAM index file you can view your reads there so right now you're not really seeing anything you need to zoom in to see your reads so when you are zooming in you'll see these little sort of hatch marks and as you zoom in more and more they get bigger and bigger and bigger but how far do you need to zoom in to see these alignments well that's kind of a setting that you set yourself so you can set these settings in your options higher values require more memory so if you are using MacBook Air do not choose a high value for low coverage files it's okay to use these higher values because there is less data for it to load up that makes sense and then for very deep coverage files so if you are doing say the circulating tumor cells or tumor DNA where you have 25,000 x coverage you would want to use a very low value because otherwise you can't see the stuff and it would take up ginormous memory on your computer okay so once you zoom in your reads will look like this so you have your reference genome here at the bottom and the different colors represent your TC GZNAs and then you have your reads here and then these different colors on the reads are where you have mismatches so these gray bars here are exact alignments and then the colors are your mismatches but you can also change those things so that's not the standard view that you can always see structural variation and single nucleotide variants so important metrics for SNBs is coverage if you have basically two reads that are covering a spot and one read says it's a mismatch and the other one doesn't which one do you believe is the coverage for that the amount of support so how many reads actually have the mismatch so if you have 50 reads covering a spot one of those reads is a mismatch you don't have support for that mismatch strand bias and PCR artifacts is it only on a read that's going in one direction so only your forward reads if that's happening that's a problem and that might be a PCR or you can also get PCR artifacts so that's a problem so what are your mismatches are your mismatches only on poor reads and also base qualities are the mismatches at poor base qualities and then for structural variance again it's coverage insert size and then read pair orientation and we'll cover that in a minute okay so again we have our reference genome here at the bottom we've zoomed way in so you can see at the top we have the two colors here which are C and then the red ones are these T's here and we have lots of support for this and lots of coverage for this so we would accept that as a good quality S and V you can also color these by which way the reads going so in this case we have the reverse reads in blue and the forward ones in red and we can see in this one that the the C's are mismatch are only on reads that are going in one direction so that would be a problem so if you're doing paired end sequencing with alumina your pairs can give you information about structural events are they too far apart are they too close together and also are they on different chromosomes that's a big one and you can change the alignment colors on IGV to show you this different information so does it match the inferred insert size or is the parent orientation weird so are most of you familiar with paired end sequencing anyone not familiar with paired end sequencing it's okay to put your hand up okay so I can go over this really quick you have your DNA or C DNA you fragment this and you have an expected fragment size but it's a distribution because you have some that too small some are too big but you pick the stuff in the middle okay so if you do this you get your forward and reverse reads and then what is from the tail end of both is your insert size but when you align it to the reference it's also going to be like that as well so the tail to the tail is your inferred insert size so your inferred insert size is coming from the reads that you've aligned and not so much from what you have chosen for sequencing because the program can't go back and read your mind what you chose yeah so your inferred insert size can help you find deletions and insertions because again if it's a deletion they're going to be too close together and your inferred insert size is going to be small if they're too far apart your inferred insert size or your expected insert size can be too large and then your interchromosomal rearrangements that you're not really going to get an inferred insert size because they're far too far apart and how do you measure that what's the insert size from chromosome 1 to say chromosome 8 so what is the effect of a deletion on inferred insert size so we're going to use our reference genome and our subject we're going to delete out some stuff in the middle so the stuff in the middle is gone this bit is smaller than the bit up there so what we're expecting is a read from here and a read to come from there but what we're getting is a read here and a read there and then we won't get this big gap in the middle so this is obviously smaller than this so our inferred insert size is going to be very much greater than expected because our inferred insert size is coming from all the data that we have all of our reads and then this size is going to be very much smaller than our inferred size so if you can color these in IGV you would have to right click on your alignments so right click in this space this menu would pop up color alignments by insert size and that will change from our gray colors to different colors so we can see them here these little red boxes here have their mates where they're not supposed to be and you can see this drop in coverage you have lots and lots of coverage here and lots and lots of coverage there but in this middle bit you don't have as much coverage as you would expect so if the insert size was smaller than expected it's blue, larger than expected it's red and then pairs that map to different chromosomes get really funky color schemes that you'd have to look up so rearrangements where you have stuff from one chromosome on a different chromosome in the normal it looks something like this and then in your tumor sample you'd have all these little bizarre colors so that you can say that one end is aligning to chromosome one and we can look all the way on chromosome six and we have different ends there so you can also look at your read pair orientations to say if you have inversions, duplications translocations or some other complex really weird rearrangement that can happen in cancer so for an inversion this is our reference genome again we have point A and point B at the ends that we're expecting and in our subject this bit here has been flipped so that B comes before A so now when we get a read here in our subject they would face inward like that the first read will be where we expect it and the second read in that pair will actually be on the other end okay and the same would happen with A so this is what we would get we expect this read to be where it's supposed to be and the other one will go all the way over here and you can see that the A's they all point in the same direction and the B's or sorry the B section all point in the same direction and that's not what they're supposed to do they're supposed to point towards each other okay so you have the left side pair and the right side pair and we can find these again by on our alignment here right clicking color alignments by pair orientation and that will change our colors to what looks like here so these ones are all unexpected pairs and again you can see the drops in coverage where those happen so drops in coverage are usually a bad thing okay this slide just gives sort of a color code for what's happening in all these slides or sorry in all these really weird things and with that I've reached the end