 My name is Michelle Brazos. I work here at the Ontario Institute for Cancer Research. Okay, we're gonna do, in this module, we're gonna do databases and visualization tools, and then we're gonna conclude by getting you up on the Amazon Cloud, so you can spend the rest of the week actually doing the cancer genomics workshop. Okay, this is what Francis would like to disclaim here. He doesn't profit from any of this information. However, he was a former employee of NCBI. I think he was there for six years, so there's a lot of NCBI content down here. He is a current employee of OICR, and he makes no apologies for that because he's gonna put that content in here, too. These are his communication mechanisms. They're not mine, but if you have any questions, feel free to ask him about his content. He's quite active, and we'll get back to you quite quickly. So what are we gonna cover? We're going to talk about databases. Databases in general in bioinformatics is more of a refresher of how databases are organized and structured, and then specifically give a chat about cancer databases. We're gonna switch to someone who knows IGV better than myself, so that we'll talk about visualization. We'll also talk a little bit about visualization within ICGC, the International Cancer Genomics Consortium, and then we're gonna conclude with logging into AWS and getting you set up for the workshop. So why do we have bioinformatics? Anybody wanna wager again? I mean, it's not an old field, per se. It's definitely a new field. Certainly when I was studying in undergraduate courses, bioinformatics wasn't even a course. It wasn't even a topic. In a course, it was pretty much nothing, and now here I am working in an informatics department which is 130 people big and the lab group is only 30 people big. So there's been a complete reversal, and that's probably because of all the genomics and proteomics technologies out there just pumping data into the internet, and all that data we have access to. So it is a young field, but it did start in the late 60s or early 70s with the publication of the Atlas of Protein Sequence and Structure, Margaret Dayhoff, and she just wanted to put protein sequence together, and she published it in a paper book. Certainly that doesn't seem feasible when you talk about genomes, but very quickly it became evident that being able to search across those sequences, protein or genomic sequences as the case is now, and do comparisons and ask questions of that sequence would be of very significant value. And BLAST came about because of that need to ask questions on that sequence, and GenBank obviously came about as a place to store that sequence not in paper formats. Francis thinks of bioinformatics the same way that wet lab people think about their wet lab experiments, and you can do experiments in bioinformatics. That's the point of having all this data. And it's the same way. You have reagents, so he calls the sequence information, the databases where you're getting this information as your reagents. You are going to do some experiment. You're going to do some protocol. You're going to align your protein, align your nucleotide to protein, etc. So you're doing some search against that sequence, that's your protocol, and you get some alignment on which you need to make an interpretation. So in Francis's sphere, this is a bioinformatics experiment. You still have to do all the same things. You're going to have your reagents, you're going to have your methods, and you're going to have your controls, and you need to ask the appropriate questions and design your experiment in an appropriate manner. Certainly this is really relevant in cancer space. You find a variant. Is your variant normal? Is your variant a real variant? Is your variant contributing to the cancer progression and development? We should be doing experiments with bioinformatics. Francis is a great proponent of think-pair-share. We're going to take a moment to do think-pair-share. His request is, how do you define bioinformatics computational biology? Look to your neighbor, look to your person behind you, and we'll take one minute and think-pair-share what your definition of bioinformatics computational biology might be. Please. Okay, folks. Anybody want to shout out their definition of bioinformatics computational biology? Anybody? Anybody? Yes. So we were saying how bioinformatics is basically anything where you're just looking at data, lots of data that's biologically low of it. We've got computational, well, at least especially I've got computational biology is more when you're applying algorithms and trying to sort that data. Sure. I think there's no one definition of bioinformatics. Anybody else? Me, when I'm teaching in the high school classes, I'm breaking down the word bio. It's got to do with biology, information, lots of information, and bioinformatics is like the automation of something. There's no one definition. The one offered by Francis is bioinformatics is about integrating biological themes together with the help of computer tools, biological databases, and gaining new knowledge about the system in study. I think it's the last part here, the objective is obtaining new knowledge. There is a reason we're doing bioinformatics and particularly in cancer research, and that is to gain new knowledge about the tumor under investigation. You're going to see these symbols a whole bunch in bioinformatics space. Bioinformatics is hugely based upon openness, open access, open source, open data, open training materials. You can't do bioinform... It's ridiculous if you develop an algorithm such as BLAST and there was no data to BLAST upon. So the whole crux of bioinformatics is really dependent upon everything being open. And so I would encourage you, if your labs are already not doing it, is to partake in that openness. The newest thing coming up is the open publications. Who's published in bioarchives or in other open access fashion? Yeah, a few hands. So this is coming, this sort of open publication such that you don't have to log in and wait for a publication to become available, et cetera. It's frustrating when you're in PubMed and you hit that journal, that doesn't have that, right? Anyways, that's a little bit of a tangent. Okay, so going back to our bioinformatics experiment, the reagent being at the place where all of that sequence information is located, databases. Databases are an organized array of information. It's where we're putting all of that sequence data. And if it's placed in there well, you should be able to get it back out. There are instances where you can't get it back out, but most people won't be using those databases. The bonus of a database is the overarching layer of being able to query that database and view that data. So if we have the structure of a database, I'm just trying to... Can you see my mouse? I've lost my mouse. Oh, here it is. Okay, sorry. I'm not visually seeing it here. So you have your data, you have data, and it's inherent metadata. Anybody want to wager a guess without looking at your notes, but that metadata might be? What might metadata refer to? Structure and framework in which data is represented. Yeah, and the tags on those various kinds of data. So date stamps, the little details around a particular bit of data, this variant was first identified on such and such a date. Sometimes this metadata is actually quite useful and it needs to be cross-matched against other databases. So the metadata is used to link those things together. Then you have your data, and this is probably what we're most familiar with, particularly if you've gone to NCBI and GenBank, you know the file, or if you've gone to Cosmic, you know the Cosmic record. These are some other examples that are synonyms that you... It would be the title of a book in a book. All this data and its inherent metadata is stored within a system, and the system is not visible to us, but it is the back end of the database. It is the MySQL, it is the Oracle, it is the analogy being the bookshelf. It is the thing that is housing all of this data and allowing the next level to make it queryable so that you can ask questions on this data. You're going to be doing a lot in this workshop. We're going to be doing grepping and searches from the command line, things that you can interact with this database and this query ability of the database. For non-command line people, the next bit is actually probably the most important is the presentation of this system, the interface between you and the database into which you can ask your question if you're not asking it from a command line point of view. This is probably what most of us are familiar with. Any questions about that in database nature? What does NCBI have to offer? It's a way to submit that data, learn about that data, analyze that data, download, develop, do research, etc. with that data. It's sort of a one-stop shop with all of the available data internationally. I'm talking about NCBI, but that doesn't mean if you're coming from Europe or you're coming from Asia and you're more familiar with EBI or DDBJ that those are the similar spaces to NCBI in North America. This representation here is that all three of those databases share their information every 24 hours. So whether you're looking through EBI or whether you're looking through DDBJ, all of that information is being shared in the open fashion that Bioinformatics encourages such that everybody is talking about the same thing. We need to be talking about the same thing. You can't do research if you're not talking about the same thing. There's now going to be a chunk of slides related to NCBI and related to how NCBI is structured, its files, etc. This is because Francis worked at NCBI and he worked in GenBank and this is where his specialty is from. So I'm going to do my best to give that justice. NCBI is so much more than just GenBank. If I had kids doing science fair projects, I'd be sending them here because you can do books and chemical... It's a great source of lots of information. Literature, health, genomic, proteins, chemicals. And for any of those things, there's the associated database with that housed within NCBI. But they're equivalent databases in EBI and DDBJ, so it is not specific to NCBI. All the information is shared out. Some places are... EBI maybe has a bit more strength in the chemical space than NCBI, but they're still all sharing their information. Landing page for NCBI, if you haven't been there recently. And the formats used by NCBI. Actually, if I'm going to stress anything that you're going to learn from the Cancer Genomics Workshop, is you're going to learn a lot of file format conversion. Getting from one file format to the next file format to the next file format. And finally, ending up in the file format that makes sense to you, such that you can carry forward on your research. Certainly in the sequence space, we're starting with FASTA files. Everybody is familiar with FASTA file, I'm quite sure. We're from a sequencing point of view. There is the FASTA Q file. Or the BAM SAM file, which is the pre-aligned file. So we're actually going to work through the FASQ SAM BAM file over the next couple of days. Getting to your VCF files and then looking at functional annotation on that file. So there's a hierarchy of file formats and I might alter my bioinformatics definition and say that it is a lot of file format conversion. But what I sometimes think bioinformatics is all about. Certainly there is layers of databases. You know, from a primary research perspective, you may only be working in the archival databases. The databases that just pretty much just store all the data. And with the associated metadata, their annotations. I think that a more valuable space, certainly a space when you're trying to make biological interpretations from your data, is the secondary data, the highly curated data, which has a human oversight on it. That is, somebody has gone in and said that these are all the reference sequences and this is the correct annotation for this particular gene. And this is certainly the secondary databases are really, really, really valuable. Every year, I don't know if you're aware, but there's actually two publications that come out in nucleic acid research that are maybe worth at least browsing the titles. One is the database issue, which comes out in January and in July, so shortly, is the web tools issue. And it's just a really neat way to pay attention to what new data has come out and what new tools to compute on that data are coming out within a year. And it's really interesting. I used to manage the integration of these tools into the links directory, and it's just really interesting from looking at a title and an abstract perspective what new tools are needed or necessary, what new databases are coming out. So January 2016 obviously is already out for database issue and upcoming, July 2016 will be the web tools. Okay, that was a tangent. The tangent was probably there because Francis is editor of the database, so I didn't make these slides or order them, so I have to go with what is the next slide coming up here. So bear with me. So we're going back now to GenBank, and we're talking about the file formats, and we were saying that one of the biggest things that you're going to do in Bioinformatics is converting file formats, the flat file, the initial file created within GenBank. I don't know if you've spent any time looking through this file, you probably scroll down right to the end, or you just say, show me, output the file in FASTA. But there's a lot of information. It's divided in three, the header, the features, and the sequence. Mostly you're either scrolling down to the sequence, but there's a lot of information within this file. It's the human readable version of submitted data. Yeah, not sure what was meant to be said there. Accession numbers. So each file, though, within GenBank has an associated accession number, and it's actually quite useful to know what the accession number is because you can gain a lot of information from the number itself. So the format within GenBank differs depending on the sequence input. You could have a 1 plus 5 model, meaning one letter, five numbers, 2 plus 5. In the case of whole genome sequence, you would have 4 plus 2 plus 6. Proteins and RefSeq files have similar but different formats. And everything is versioned. So if a single nucleotide changes from the original file, you'll get the same accession number, but you'll get a different version. When you do a BLAST search and it pulls up all of these particular hits, you want to make sure that you're actually on the most current version. So you want to, you know, if there's more than one version, you're going to choose version 3 instead of version 1, unless that you were pre-computing on version 1. Just pay attention to the versions. The other useful bit of information about accession numbers is the coding up front, nc underscore, ng underscore, nm, nr. It'll tell you what the sequence is, corresponding sequences. So nm is messenger RNA sequence file. nr is going to be an RNA file. And this differs. So you can very quickly see what, you know, if you have a whole list of a whole bunch of hits and you're only looking for RNA, well, you're going to click on the nr underscore file, as opposed to the nc underscore file. NCBI is running out of numbers, so they're actually going and expanding their accession numbers. I don't know what they're going to do when they run out of digits, numerical digits. Maybe they'll increase the letter digits as well. I have no idea. Interesting, we're running out of unique numbers. Assemblies within NCBI also have interesting annotation models. Everything is, you can read a lot from these accession numbers, as I was saying, the model organism files also have the X in front of them. So it's a good indication that that's a model file. Whole genome sequence, the special cases. They don't know what he meant to say in this. Other than pay attention, if it's an NZZP. Not sure. Okay, this is a FASTA format, which you should all be familiar with. You've probably seen this quite a few times. This one is quite a good FASTA file because it actually has the information up here about the accession number, the reference. So we know that this is an NP. So it was in the protein space. And with the accession number and the version number. And then some descriptor of the sequence. So it's the p53, isoform 9, and the organism that it's from. If that is missing, and it's just an arrow, it's really hard to tell, but sometimes that is missing. It's just empty, and you don't know what that information is. NCBI publishes regularly on the updates and the resources, as does EBI. And I haven't actually seen DDBJ publication recently on an update from there, but it does similarly provide updates on all of their things. From a biological perspective, I think as Trevor had also discussed, having a presentation of that data, visual presentation of that data, and overlapping that data with other data in the same space is actually really valuable. So most of you probably spend your time in UCSC Genome Browser. Yes? No? What is your favorite genome browser? I don't know. Anybody? Anybody other than UCSC? Sorry? IGV. Oh, great. Look, when we get to IGV, you can come on up front. Yeah, so we're going to spend the rest of the week actually using IGV, but at some point you cross back to the database browser, the UCSC Ensemble NCBI Mapfuer. I'm not a big fan of the NCBI Mapfuer. I understand the UCSC a little bit better. A new browser coming out or is out in the cancer space is the ICGC browser. So the International Cancer Genome Consortium database, which houses all the tumor data for the ICGC project, has a browser and it's modeled after the UCSC genome browser, so it has those layers and those tracks. And that's really a useful space to see where your variants are and what your variants are doing. This is an example of TP53 and all the various tracks for TP53 mutation within NCBI. Okay, the Genome Reference Consortium, we said, as you recall, it's the highly curated... Oh, no, GRC. This is the human reference. Genome Reference Consortium. These guys are constantly updating the annotation of the human reference genome, so we're on version GRCH38, but 39 is already available and at some point we'll switch over. Or maybe you have already... Are anybody computing on 39 already? Possibly. Yeah, most are still computing on GRCH38, but then the next version is available. So the release of the human genome happens and then all the tools and browsers, etc., spend some time catching up and doing... re-computing before everybody moves over to the next annotation. And it's iterative. And it's well documented what changes in each of the human versions. Oops, sorry, going the wrong way. Historically, the human genome data, we didn't start with the whole genome, but obviously we started with this expression, sequence tag sequences. We moved to mapping and GWAS studies, etc. Then we did eventually come up with the full human genome same time in parallel with all this sequence technology, high throughput sequence technology coming out. You could start to think about doing tumor projects. So the TCGA had a pilot project. A thousand genomes project is happening. The pilot project for TCGA is successful and they expanded out. And then the ICGC project comes on board. The ICGC project was October 2, 2007. That was my first day at work. So it was a really interesting day because ICGC was just getting announced. In a moment we'll talk about the differences between TCGA and ICGC. There's similarities and differences. But the cancer space didn't start with these big international projects. They obviously started with much smaller scale projects, looking at 18,000 genes, looking at 518 genes, but across more tumors, and then moving into sequencing data and doing full-scale tumor sequencing. But from these smaller projects, there were a couple of things that were critical to the current international projects. And from those smaller scale projects, we learned that there is a lot of heterogeneity within a tumor and across tumors. That there's a high rate of abnormalities and that there needs to be some standard approaches to laboratory protocols if you're going to start to combine all of this data. So these were sort of from those earlier projects, those were some of the lessons that were learned that fed into the bigger international projects, such that they got off, I think on a much better footing than would otherwise have been possible. So the International Cancer Genome Consortium, as I said, started my first day of work. Its goal was to collect 500 tumor normal pairs from each of 50 different tumor types. So it works out to 25,000 sequences. It's quite huge. It's an enormous undertaking. And the intent was to be comprehensive on that sequence, to look at the genome, the transcriptome, the methylome, and incorporate the clinical data. And underscoring all of that was to make it available internationally. If it was an international project and Japan is contributing and Spain is contributing, then they're contributing their part, but they want to compute across everything. So it set up this rationale. Pool our resources, really, and do it all together as opposed to doing it individually. It makes for a better end value. You're not just doing 10 pancreatic. You might have 500 pancreatic. And if everything is done in a standard uniform format, then you can be more confident that the sequence that was done in Spain is the sequence that was done in Vancouver, et cetera. So this is the rationale for the ICGC. And these are the kinds of things that were put into the ICGC project. The tumor, what tumor are you looking at? Information about the sample collection, information about how the sample is extracted, information about the sequence, about the analysis on that sequence, and then information about the interpretation. We'll go through some of those. In its current point in December 2015, there are 85 projects across 18 countries. I'm going to say jurisdictions because some countries have two locations. And we're covering 42 of the 50 cancer types that were set out. Currently, there are 15,000 donors. And at this point, I think it's probably... Much of our research couldn't be done without those donors. So those donors are to be applauded for their participation in the study. ICGC undergoes data releases twice a year, and we just had a release and another release is coming up. But this is the growth of the data within the ICGC data space. The URL for ICGC, and highlighting all of its current 42 projects. Canada participates in at least two. Heather, maybe correct me if I'm wrong. I know we do pancreatic and we do breast. I'm not sure if we do any other projects from Canada Space. But you can look that up. Prostate as well? Okay. So there's that. Within the ICGC website, you can navigate towards any of the parts of ICGC. You can navigate towards the data portal. DACO, which is the Data Access Coordination Center. You can get information. You can log in. The DCC, so this is the Data Coordination Center. That DCC is actually situated here in this building. OICR is the host for the Data Coordination Center. And the teams upstairs have put a lot of work into development of the portal. And this is their portal. Yes, so we just had a release on May 16. That's how recently there was a release. Release 21. So the Data Portal, there is a lab exercise. So we'll move into that in this afternoon. But it's really, really instructional data portal. You can keyword search. You can also do fauceted searching. So you can select what tumor type you're interested in. You could select the project by the country. You can select the particular variants that you might be interested in. Or the tumor type. Sorry, this would be Oregon type, I guess. This is a display of the somatic mutation rate across selected projects. So projects over on the left-hand side have less frequency of mutations and projects on the right-hand side have a higher frequency of mutations. It's just a really interesting way to see how different tumors have different mutation rates. They display always on the first page the top 20 most mutated genes. Of course, TP53 is the first one there. But it's sometimes interesting to look at some of the other mutations. And you can identify whether that mutation is highly mutated in the tumor of interest to you. The way ICGC... I'm just going to omit the word ICGC and just go to DCC. The way the DCC is organized is by entities. So you can look at projects, which is like what is Spain doing or what is Canada doing. So you can look at projects. In this case, we're looking at breast cancer TCGA. It also has gene entity pages. So you could look at a particular gene of interest. So you're doing a study on CML and you're interested in the BRCA mutation, whatever you might be interested in. It has a pathway entity page. So similar to what Trevor was showing, you can see in a particular pathway where all of your somatic mutations might fall. It has mutation entity page. So if you want to look at one particular variant, you can do that. And of course, as I was saying, it also has the genomic viewing page. So you can see right down to a sequence level, gene mutation level, you can see what your data... how it renders from a visual perspective. Lots of other ways to explore data within the DCC. As I was saying before, the entity pages, so you can have donors, genes, mutations. But there's also here an advanced search option. So this is the icon for advanced search. And you can drill down to a particular donor and find all of the information on a particular donor. It's really interesting to also see the statistics and other information from your donor. I mean, you can select... In the faucets over here on the left, you can select all males between the ages of 35 and 50, like non-smoker who have lung cancer from China. You could drill down in that way to ask very interesting questions. I think it's quite useful. The other icon is data analysis. So you can do deeper analyses such as enrichment analysis and looking at phenotypes. The enrichment analysis is done, but some of the other deeper analyses are just coming on board. It is a data repository, so that means that if you have done the search that you are interested in and you've come up with all of the donors associated with a particular search, you can output your data. So you can output all of the data that you would be interested in. So this is his schematic of output. You select what searches you had done. You can download and output the data file right to your desktop. And then, of course, you can view it. You can command line, evaluate it, et cetera. So we'll leave this for another moment in time. Later in the week, you're going to come to all that. Okay, so ICGC versus TCGA. They're the same, but they're different. So ICGC houses the TCGA open data, but then beyond the open data part, then it actually splits. So the TCGA BAM FASQ files will be stored in TCGA, and all the rest of the international participants will have their BAM FASQ files in ICGC and within EGA. And the reason, so from a Venn diagram, this is what it looks like. We do, ICGC does contain the TCGA open data, but beyond the open data, then there are differences. Differences come because of the, I think it's in a couple slides, but there's, because of the definition of what is open. So the U.S. has a different definition of what is considered open versus what the rest of the international community has agreed upon would be open data within cancer space. So, yeah, you just want to pay attention to those differences and understand it, and if you need the data within TCGA, then you have to go to TCGA to drill down to get that data. So open access data sets within ICGC and closed access data sets within ICGC. So the DCC portal that I just demonstrated contains open data. So this is free to use to anybody. What you don't get is you don't get the controlled data and you need privileged access to that. You can get privileged access, and we'll just talk about that in a minute, but there are things that are not available in the open portal. So these are, this is what is on the right side here. This is what is under controlled access, and this is obviously, it's under controlled access because it's identifiable data. But so the ICGC open data contains somatic variance from exome or whole genome sequence. In comparison, the TCGA and the US, which defines open data differently, these somatic variance from scrubbed exome sequencing are contained in the portal, and most of the somatic variance from whole genome sequencing, they consider to be identifiable data and this is under controlled access. So you have to be aware that in the DCC portal you're not going to be seeing the somatic variance from TCGA data if it's from whole genome. Okay, so as I was saying, the biggest difference is how they define open access. And at the NIH, simple somatic mutations from exome sequencing experiments are open access, but from whole genome sequencing, these are controlled access, and you won't find them in the ICGC data portal. So if you want that, you have to go to the TCGA portal. Okay, is that clear? Okay. There is, though, an agreement on who is a controlled access person. So the ICGC and the TCGA have agreed that you have to keep all of your computer systems where the controlled access data resides. It must be up to date with security protocols and software patches, et cetera, so that it cannot be hacked. That you must protect that controlled access data and only give access to people who have been authorized to gain access to that data. And that this authorization and this access is monitored. The groups also agree that when your access is up, that all of the data is destroyed, that you're using secure transfer protocols when you're downloading or sharing that data, as well as encrypting that data. Any of the variants that are discovered within the ICGC TCGA project are still put into Cosmic without all the identifying information, so you can still find those novel variants within Cosmic. Cosmic being, of course, everybody's heard of Cosmic, right? The catalog of somatic mutations in cancer. If not, we're going to be using this quite a lot. In the workshop and you can download all of the Cosmic data from the Cosmic website if you needed to. So underlying ICGC and how do you get controlled access to all of this data? And there's a whole bunch of different groups within ICGC doing various things. So there's project groups, there's working groups. For example, a working group might develop the protocol for sequencing on an aluminum machine. How do you make your library? How do you do DNA extraction? How do you make your library? What fragment size do you use? How do you sequence? So everybody's following the same protocols. But the group that we're going to talk about is the DACO group. So the Data Access Compliance Office. So if you would like access to the controlled access data, this is what you need to do. You need to identify yourself, fill out the form. It's a whole bunch of paperwork that contains the information that you want access to, the IT that you have for keeping the data secure, etc. And you must agree to the rules of accessing control data. And you sign it, your institution signs it, and you send it back. The person who signs it, though, from your institution is the person that can fire you should you break any of those rules. So it's under that person's authority that if it's reported that you did something unapproved with that information, like share it with a colleague who didn't have DACO access, then the DACO office can notify your institution and the regulations are there for a reason. That being said, it's not... There's a huge number of people who have access to the data because it's really useful to be able to compute on this data. So this is the login page if you have access. So this is Francis logging in and he's filling out his information as if he was applying for DACO controlled access. And currently, if you did not have controlled access and you logged in, this is what you would... You would have nothing. Everything is red. If you have DACO approval, then everything turns green. And this is a sort of indication of the kinds of things that you can gain access to. The application, it's really... It's not very long. It's available online. And once you have approval, you can log in with OpenID, Google ID, et cetera. So it's linked in with other ID systems. And login is at the top of the portal and you'll see a different version of the portal then. All of the data within the portal undergoes a publication moratorium and it differs depending on when the data was submitted and how much of that data has been submitted. But you should be aware of the moratorium on publication. The reason that there is a moratorium is that the group who computed, like Canada, let's say, is doing pancreatic data and they are submitting their pancreatic data to the ICGC. Well, they have first rights to publish on the pancreatic data before a group in Italy wanted to publish on pancreatic data. I mean, it seems scientifically fair that that kind of moratorium exists. It differs depending on the data submitted. So the yellow bars are the submission time point. You have two years before the moratorium is removed if you have a small amount of data that you're submitting. If you're submitting large amounts of data, like 100 submissions, your time point is actually shorter. If you do one submission and then you do a big bulk submission, then you get one year after that. So it sort of falls back to this model here. If you do more than one and then you do 100, well, it starts to say, well, you know, you've already reached, you have 100, but at two years, this is the two-year point, from this submission, your time is up. The moratorium is released. So that's sort of, there is a moratorium, it sort of depends on the data, but all of that is clearly marked within the controlled data. And if you want more information on publication guidelines, there, of course, are available. So ICGC puts its raw data within EBI's EGA. So if you're looking for the raw data, the raw controlled data is within EGA. TCGA is storing it in the cancer genome hub, CG hub, I believe. And that's where you would gain access to the data. So again, to reiterate, from the ICGC website, you can get open data. If you're looking for controlled access data, that data is behind a login and also stored within EGA. Lots and lots of documentation within ICGC on all of these bits and the URLs for all the things that I have talked about. Things that we have not talked about, which are just as important, is a study that is just wrapping up right now. Actually, I suspect that it will conclude while you are in this workshop and it'll be announced while you're in this workshop. It's really, really critical evaluation. Some earlier publications have already come out from TCGA on a pan-cancer analysis, but this is a pan-cancer analysis now using all of the ICGC data. I believe they did 2,000 tumors. Anybody out where? Anyways, so what they've done is they selected tumor normal pairs from across a whole bunch of different tumors within the ICGC and they are computing it in the same way. So they're doing the same alignment, the same variant calling tools, the same quality control filters, etc. So all of the data has been analyzed in similar pipelines. In fact, there were three pipelines. The pipelines were merged, so the overlap with those pipelines, all of the data was validated. So it's a huge amount of effort. I think it's a couple of years that they've been working on this, but that should be finishing this week coming available. As well, there is the International Cancer Genome Consortium Collaboratory in the Cloud. So it's one thing to have all of this ICGC data available. If you can't compute on it all in one space, you don't want to download all of that to your laptop. What do you do? You don't have all the resources. So the Collaboratory is a place to go where you can compute and all of the data from ICGC is sitting there. So that's what ICGC is about. That is the end of my part. And we're going to switch to IGV at this point in time. But before we switch, are there any questions on the information that I presented on behalf of Francis? You're all familiar. You knew that all anyways, right? Yes. I think it's really interesting stuff. There's two lab exercises that we will endeavor to do this after the lunch break. One is on IGV. This one is the one that we're going to start with because you're going to spend the rest of the year a week doing IGV. But there is also an exercise on ICGC and learning to do some analyses with it. It's a really, really, really cool portal. Especially if you're like myself, I have a biological background and the command line doesn't come as naturally but being able to drill down using faucets is really, really, really helpful. Any questions? So this pen cancer project that you mentioned, you mentioned that ICGC. ICGC, yeah. Yes, that's correct. Also on the slide, it said 2,800 whole genome mutations were also gathered. Yes, thank you. It's like an addition, do I suggest? No. So 2,800 tumor normal pairs from the submitted data within ICGC were selected to participate in this pen cancer whole genome analysis. So a whole bunch of tumor normal whole genome sequences from various projects within ICGC were selected. Basically it's subset of ICGC? Yes. Do you restate that? Very good question. I believe the whole data, the pen cancer analysis is going to be unrestricted. I think it's going to be open. I'm going to double check because I'm going to go right upstairs and I'm going to ask the girl who's in charge of the project and ask her. I think it's going to be open data and what I'm going to ask her actually is whether any TCGA whole genomes were in that 2,800 because I suspect that they weren't in order to allow the somatic calls to be open. But I'm going to double check that and I'll come back. Good question. Anything else? Okay, let's move on to, we'll do the lecture for IGV, then we'll break for lunch and then we'll come back after lunch and we'll do the lab exercises, okay? Okay. Hi everyone. So I'm France. I'm covering as well for my slides but hopefully this would go smoothly. And I actually know that most of the examples on the slide are from the ICGT tutorial, IGV tutorial from the Broad and CGT. So IGV actually have a very good tutorial if you want to refer to. So I will go through a few basic of IGV to give you an idea of what we can do with it and the tutorial will be this afternoon so you can practice it. So first IGV is a general browser. There are actually quite a few different ones you can use. This one is highly used by the community and the research cancer community. You can put different type of data as you can see that's available through the chat. Actually Trevor shows this morning his lecture. He had already saw a screenshot of IGV. It can be epigenomic data, microarray, and high throughput sequencing data, some marinesic data, some expression data, some copy number data. That's all different formats that will be accepted and you can be able to visualize it. What can we do with IGV? We can explore large genomic data sets that are very sensitive and easy to use in surface. Basically, you can integrate as well some different data, some clinical data and some genomic data, for example. You can use data from your own data, so have it on your computer, on your server and load it. You can use it, some data that are remotely accessed or even on the cloud. This is a good feature because you might want to see some TCGA data but you don't want to download everything locally and then look at it. You would be able to have access remotely. You can as well implement some automatic tasks. If you want to take some screenshots of a particular or new variant you've been called, you can actually write a script to make this automatic. It will go to the right location and take a screenshot and save it for you. This is quite an advanced usage but that's possible. Yes, the data source you can use, as I just said, the local file, you can use data from an HTTP server, an HTTP server from the cloud and from other data repositories. It's local or remote. The basic of HGV, first you would load it. You would select a reference you know. You would load the data you want to look at and then you would navigate through it. For world genome sequencing data, we can look at SNVs or we can look at structural variants as an example. That's the standard thing you would do with HGV. If you go on the HGV webpage, here you can see it's the way you would go through to be able to access it. You can load it directly from the website or you can register and download the HGV application to be able to run it locally on your laptop. This is the type of screen you're going to get when you're on HGV. First, you're going to select the genome you want to work with, HG18, HG19. Now it's money as HG19, but you still have the option. Then you will load your data by file loading as a file you have on your laptop from your URL. You can as well load some public data. Here that was an example of a tutorial you can use a public data example. This is the screen layout. This is the basic default one. You have different parts on this page. At the top you have the menu as a standard application. You have the toolbar which you can play with like to go back to the initial setting or to turn off a pop-up information that can become quite annoying so it's probably something you're going to turn off by default. We're going to see that in the tutorial. You do have the genome ruler track. Here you can see we actually look at the whole human chromosome, chromosome 1, 2, 22, and X and Y. Then we're going to be able to zoom in. Every single row below is a track, so every sample corresponds to a track. Since we zoom completely out, you don't see much, you just see coverage. You have the track name on the left and you have actually here the attributes column information. You can load some clinical data. For example, you can specify if you have the sex of the patient, if it's metastasis or not, if you have a classification of the type of tumor you're looking at. This can be color-coded here. You can start with those as well. That helps you to start the tracking of the particular manner. At the bottom is the genome feature. This is where the gene description is. You can see that if you zoom in, you will see the name of the gene and then the location of the exoner, the intron and so on. What file format can you use? Basically, the file format, your input is going to define how your track is going to look like. You don't have to give the setting because you would recognize that it's a band file. It's going to show what you want to show for a band file. You can sort them in different methods. These are an example of the file format you be able to handle. Actually, there are more than that. There is a list under the link that is at the bottom. The classic one for isopropysegment signal, it's a band file as well. We use it most of the time, but we use all of them. We want to view some alignment. That's the whole genome view and what you start with and then you don't see much and they actually have an indication that you need to zoom in to see the alignment. You would do that by putting a particular position here or just zooming it with this bar. When you zoom in, how far do you need to go to actually see the element? Well, it depends of the band file you've been uploading. Usually it's 30K. It depends of what is your coverage. So how much read you have at a particular location. If your coverage is very high, you would need a lot of memory so you're going to have to zoom more to be able to see it. If you have low coverage, that's easier, but we always prefer higher coverage. So it's just a trade-off. I cannot give you a number, it's just a function of data you're going to load, but you're going to experience it and it's not very hard to just look at it. So when we zoom in, we can actually go deeper and deeper and that's the same kind of view you saw in Trevor lecture. So what is all the gray are the reads that are mapped and they are mapped properly. So they are gray. If you actually go to, if the color goes more transparent even to the white, that means I don't think there is, I don't see an example of a bad video, but I mean the quality of the read mapping is pro when it's going to white. So you wouldn't trust it. So if you have a kind of background gray, it's good. And then all this color bar like you can see as blue is that all the mismatches compared to the reference. So if you have a column of mismatch like here that's pretty consistent at the same position you, it's likely to be a SNPs or an SNB in function of your data. So you can visually kind of spot it what's might be more interesting in your data and then you can zoom in and zoom in and see more things. We're going to do it this afternoon in the tutorial. So the SNB and structural variation. So that's what we can mainly look at when you have world genome sequencing. So what are the important metrics we want to look at to evaluate if it's an SNB. First the coverage, then how often this alternate base is actually present, the amount of support. Yes, artifact are all the reads that have this different this mutation. Are they all on the same strand or are they equally distributed on the strand? If they are all on the same strand that's not a good indication that means something is probably an artifact. The mapping quality of the base as well the stronger the color the better the quality. So if it's you will see in the next slide if it's a dark red that means the quality was very good on this base. If it's a light red for the T that means it was not a good quality. So that's what you're going to look at for a particular SNB to be able to judge by yourself after you've been calling it with some tool to see if you trust it or not. In terms of structural variation we're going to look at the coverage, the inversion insert size I will give an example after so if you have a non-adjected insert size that's a good indication that the structural variant happen and some read pair orientation will we have some slide about showing how this happens. So that's an example of a SNP or an SNB so basically you can see that there is an alternative to the to the ALL as being a T and on the top you have the coverage track I just want to show my mouse here so that gives you the percentage of T2 to see so you can see it's basically 60 to 40 percent so that would be a terrorist SNP as an example and you can sort you can sort the all the read by base and you would have all the T at the top this is an example of another one and here we call out the read by strand so the red one it's a forward strand and the blue one is a reverse strand so you can see that all the alternative bases that here the C's are on a particular strand so none of them are on the red read so it's not a good indication of a trustable SNB for example we can look at some structural events so the pair it can give you a good indication of if a structural event happens like a duplication a deletion, translocation or innovation, for example we can check the infer and insert sites and the pair orientation that's the two things you can use to detect a structural event visually so when you do pair and sequence you have your DNA your C DNA, you connect it and then you do your fragments your DNA and then you will select a particular site usually it's just 350 base pairs so of course there is a distribution of the site so it's another 350 base pairs and then there is some tails and then you will add some adapters and then you will start sequencing following the two arrows so that's the pair reads that's the standard sequencing pair reads you would have so you're expecting a particular insert site when the experiment goes well and when you don't have any structural variant quite often it's 350 base pairs well it depends on your experiment you won't know when you have the data but that's a standard insert so when you align it when you align your pair reads you can actually infer the insertion site to check how far apart so if they are a lot further or a lot closer something happens to your genome as compared to the reference one so the insertion can be used to detect deletion and insertion and enter cosomal rearrangement it happens at some time you pair read one of the pairs with map on one chromosome and the other one on the other chromosome so probably a translocation happens if both map well so that's an indication to help you to detect those so if we take the example of a deletion what is the effect of a deletion on the infer insert site so let's say we have the reference genomes and in your subject you lose a part in the middle which is in red so you two extreme ones are going to come together as you would see in the animation and then you would have your normal insertion of the demon and sequencing with the adapter and you would have the normal insertion on your sample which would be that but then when you map you read these two reads to the reference genomes they got to be a lot further apart so the infer insert site is larger than the expected value as you can see and on your AGV you can actually color your read by insert site and it will happen like this so you have the red one with a pair with larger expected insert like coloring in red and you can see that in this region for this particular example you have less read covering the region in the middle and it stands on the right and on the left and you have this red pair that are the extremity which are the limit of the deletion but you still have read in the middle so it's not a homozygous deletion it's probably not a terozygous deletion it makes sense another explanation would be that it's a tumor sample and not all cells in your particular sample have the deletion if it's an homozygous one so you would still have some read from the cell that don't have it and you would miss for the ones that have the deletion so that's another way of interpreting the data it's in function of what is your sample so the color code the smaller than expected insert site are blue the larger one are red and when you map on different chromosome you will have the pair colored in function of the chromosome number you don't have to remember that of course but if you have a multicolor screen you will know that it's because it's mapping on different chromosomes and IGV allow you to actually split your screen into two so if you have some reads that are here for example chromosome one and the pair match to chromosome six you can actually use a setting on IGV to have half of your screen looking at the part of the pair chromosome one and the other part on chromosome six that's a useful setup here to explore those events those structural events here's the color of the met on chromosome six what is brown in one and blue in six the other thing we can look at is the pair orientation orientation of the pair can reveal duplication, translation complex rearrangement so the orientation is different in terms of read strength so less vessels write and we'd order the first vessel the second goes through an example so if we want to look at an inversion you have the basic reference genome and the section we're looking at and in your sample you have an inversion with a segment A to B which become B A so what's happen when you sequence it you will have a fragment around B with your pair oriented like this pair would look like but when you map it to the reference genome the left part of the pair would map normally before A but the right part of the pair would map towards B on the actually other orientation that you are not what you were expecting alright and if we look at a read pair that would be close to A on your sample the sequence the sequence would happen like this the right part of the pair would map normally to your reference genome close to B the right part of your genome and the left part of the read pair map in the orientation close to A okay so basically you end up with read in the same orientation like this or like this and further part that you were expected so that's the typical type of read pair we would look for an inversion so you expected so you were expecting inward facing and they're actually in the same orientation so the the ones that are both in the same direction starting from left side pair are called the left side pair and the other one are the right side pair and they have a particular color code so left side pair are which colors I believe and the right side pair are darker blue now that's the color that IJB gonna use so when you're on IJB you can color by pair orientation that's on the menu and you will you might have some screen that looks like this when you have your left side pair and right side pair on both side of the boundary of your inversion and you can notice as well the drop in the coverage as the break print of your inversion as you on the coverage track here so that would be an inversion and that's the convention for the different color and the different type of read orientation that are other than what you would have on a standard regular genome you have the left right which is the standard one then you have the left left going to the same orientation or to the right right or a right left which is that the last one, the green one that we looked at last translocation or actually duplication so that's the standard read pair and that's going to be the and that's the acknowledgement sign for conferences do you have any question? can you go back to the slide of the deletions? yeah I think the next one the one that has the reads this one the reads are small blocks right? so can you explain exactly where you see the deletion that there are less blocks in the middle so you have these red ones and those red ones they should be closer together than what they are so what the red can use as the insert the space between your reads actually further than expected okay and where do all the reads in the middle come from? there are but where do they come from? so it's a tournament deletion so it's from the normal copy it's what I was saying that it could be a tumor sample that has some mechanicality so you would have some cells that have the deletion and some cells that don't so you have to read from both from the normal cell if you would lose it in both trauma cells that would be if it's really thin so we don't have any and it would be like into the solution if it's really homogeneous and really thin sample if you do a single sense that you have an homogeneous deletion that would be okay and if you go to the rearrangement which is two right so here you have just a piece of in the first line the red read and the blue read supposed to be on the same chromosome so it's in the middle so you so basically here you have two patterns so you look at the chromosome it's part of the chromosome 1 and part of the chromosome 2 you have it because you have favorite that I think that link these two regions so you can actually still another one to look at you and you will see how it is so which is the link you would actually you can click and you would know where is the pair there is a menu we will do that and you can click on a particular view and they will tell you his pair map at this particular position on this chromosome so it's too far yeah but here you so that here I give you all you to actually open this and you can see that you have several read supportings several read pairs that are both on chromosome 1 and that pair are on chromosome 2 if you only had one you wouldn't press it it doesn't mean that very much but you have higher than expected number of pairs that are made on two chromosomes and the number of reads and the bilis that have rearrangement and deletions pair if a pair reads them will we see the boundary of a big chunk of your version read being colored so if you know which are your pair because when you do the sequencing you know that this read is there is a type from your read and your band you know that this is the read one of your pair so you expect them to be closed and then it's going to be colored it's going to be colored it's when from your band you know that this sequence and this sequence are the pair you expecting that's the question the pairs show up in the same line also if you read pairs yes to show the same line yeah well here we collapse it so you have several that would be with your coverage being very high if you only have one pair per line it would be just way too strong so we collapse it so you have several read pairs on the same line but yes there but as you will see this afternoon you can click on a particular read like on a particular gray box and they will tell you exactly where his pair mapped and you can find where is the pair exactly you can color it as well okay I feedback on the pancancer whole genome analysis it's not the answer I expected so the data is going to be both open and controlled the data set is because the data set does include TCGA data which is masked so the TCGA data somatic mutations will be masked and only in the controlled access could you gain that information yeah so it actually follows the same split model yeah yeah so the the germline yes of course the germline is actually the identifiable data okay so we're concluded for the lecture