 So if you're not here for the cancer data and it's analysis practical workshop, you're in the wrong room So now it's time to leave Otherwise, I will get started So as Michelle indicated all our material is going to be or is currently or will be on the biofrancs dot ca website and What we do for all the material there and we have this in front of all our presentation is a creative commons license which basically provides Says that you can take this material you can share with your colleagues You can use it and your own teaching material if you need to you can It's a it's got a share like and which means that if you share it then you have to say where you got it from who which which lecture and so forth and There's sort of a viral component into it as well and that if you use our slides in a talk for for yourselves That means you have to share your slides too. So that's that's a little catch But it sort of encourages people to share their slides in addition to that. I'm overly I'm overzealous. I think sometimes but I like to share and I like to encourage sharing and I like people to Tweet blog whatever they like to do there. They're also all available to do it So we had to slide on I I mentioned on biofrancs dot ca. I also put them on slide share as as From there as well so more than a year So today, I'm going to talk to you about databases and so the databases in which we find Cancer genomic information There are many of these out there and so we're not going to cover all of them We're going to actually only cover a very small fraction of them and as Jun Jun mentioned we're also part of the international cancer genome consortium ICGC Which is one of the many acronyms you'll hear about today and so So we're gonna have heavy focus on ICGC resources ICGC Includes the TCGA and other acronym the cancer genome Atlas as an NIH funded project and so we sometimes talk about ICGC and TCGA and And it's sometimes unclear which is what and what it means and so forth and I'm gonna hope to clarify that with you today So you have my contact information here and the schedule is that we'll talk I'm gonna give you a bit introduction about I'm gonna do a little plug and advertisement for the biofrax dots Canadian biofrax workshop series a few databases about how to get permission to access some of this data and the issues relating to accessing human data and the privacy and ethics issues related to that and Shortly a quarter for a small for a few slides only I'll talk also about cosmic which is a very important database used in cancer research. It's actually Comes out of the Sanger Institute here and is a also integrated with the ICGC data portal so biofrax.ca is a Website hosted at the OSCR which is the home page for the canine biofrax workshop series It's also the hope we also post biofrax jobs. We also post profiles of Canadian bioinformaticians and Links directory and so forth. There's quite a few resources But the bulk of the most important thing about this website is the fact that it hosts The workshops and you can apply to the workshops and so forth one of the flagship workshops that we have Is the biofrax of cancer genomics and And it's these are this is a five-day workshop. We we provide we have once a year and it's The details of this workshop are here next year. We're planning to host this workshop to end a few other ones it's a bit cut off but is we've got We will probably have up to 12 workshops next year, so we have a cast of genomics I just mentioned we have a high throughput biology on which is basically Next-gen sequencing workshop. We have introduction to our And exploratory analysis of data with our we have RNA seek workshop. We have Pathways we have statistics Analysis of metagenomics and the three bottom ones are new workshops We're going to introduce next year, which is biofrax of big data. So learning how to use cloud computing to Work with the various with the various data types, but mostly next-gen sequencing data Epigenomic data analysis and quantitative genomics. So these are all like two three-day workshops except for the cancer one Which I told you was five days and they're offered mostly in The summer but we probably expand a bit outside of that time period and all workshops have to have not only this workshop We'll have this material online, but all workshops have their material online So for each of these workshops that took place last year You can go online and get the PowerPoint or the PDF files or there's a movie file So like Michelle mentioned we record the voice over the PowerPoint. And so you have the lecturers Voices in a movie format that you can download to your iPod and listen to or watch while you're getting home on the train So the course if you have any questions you want to email us about this course the course underscore info at biofrax.ca the website is there and There's a mailing list that you'll get announcements about new workshops It's very low volume mailing list Where you get a few announcements about the new workshops scheduled and the material being online and so forth So this is sort of my opportunity to sort of share with you that Part of the reason why we make the material online available is that open access open data open source are really essential for science and so this is our our sort of showing you and demonstrating and applying this to our what we do and it Being open about what we do about the science we do is that it's not only a Think we should do it's a responsibility and obligation and something that comes with the privilege of getting publicly funded To do this work and so obviously the workshop series are Are subsidized by our institution and so it's a we we get sponsors We get we collect fees for the workshops and so forth But we it's not for it's not for it's not run as a for-profit organization It's really run to break even so that we can pay for all the expenses of the things that we have to pay for So I'm gonna start off This database of cancer data lecture and with Showing you a few slides from this book the emperor of all maladies and how many people have read this book We'll have the class great. I think it's a great book. I really enjoyed reading it some great quotes in it Cancer therapy is like beating a dog with a stick with to get rid of its fleas. So that's not very good and obviously to Understand cancer and to understand what is not functioning properly is is key to the work we do and so we think that understanding that the Cancer at the genomic level understanding the mutations and the pathways that are involved in cancer We'll be able to have better understanding of what best therapies we can apply and obviously there's a whole Series of pathways and Biological functions that need to be understood to be able to intervene in ways that can make Make us and do advances and can't cancer therapeutics the Another quote from bird roguestine from the book is that the revolution cancer researchers Can be summed up in a single sentence cancer is in essence a genetic disease we sort of Tweaked it a little bit at OICR and we sort of cancer is a disease of the genome and I think Everybody here probably agrees and understands and understands this but of course it The Challenge a big challenge in treating cancer becomes that every tumor is different and every cancer patient is different even with the so-called same Same pathology and so even if you see two patients with whoops That wasn't supposed to happen Even if you see two patients with pancreatic cancer for example, you will see very you'll see different mutation profile you'll see different Phenotypes and so forth so trying to understand what's common between these patients what's common between patients of tumor of cancer of the same type and And what's unique about them is is also Critical and so one of the things that is important to be able to do at the beginning is to Know what already known about Various tumor types that you're interested in and so this is why we have databases is to actually store and Have information retrievable so that we can understand what's already known and First thing you do when you want to look up for a database of interest and so you go to Wikipedia and you look up Cancer databases or cancer genome databases and this is one of the second or third page that comes up and if you look at that page you will see a number of Databases that are available and we're only going to talk about three of them here today The ones that have the red circle next to them, which you don't see on there, but it's on the hand up is so the TCTA ICTC and Cosmic as I mentioned before and I Have here some some publications that you can look up So if you just replace the the integer there the number the PubMed ID On the URL at the bottom with that ID then you'll get these papers From TCGA the cancer genome atlas ICTC, which is the International Cancer Genome Consortium or Cosmic the database of somatic mutation and There's also a paper about data access, which I'll spend some time talking about as well so the other thing I should mention and Is that everything I'm presenting to you today will probably be changed next year or six months from now Or what have you for example TCGA? We know we're actually working with the Colleagues in in the in the US to actually change the whole TCGA data portal and how TCGA data is going to be Available so what I'm presenting to you if you came and took our cancer genomics course next year You'll get the new version and that's probably true of every year and all of these for many of these resources That said today the TCGA Will always exist in one form or another with a different user interface to understand it and view it the data Will remain the same because the TCGA? Project itself is basically almost finished and that they've generated the data now They have to their the data will be available in a number of different portals and different views But the data there's not going to be much more new data It's funded by the NIH in the National Cancer Institute within the NIH and It's also Initially was funded also by the NHGRI the National Human Genome Research Institute And so those are two different institutes within the NIH and national Institute of Health And so NIH is built of about 13 or 15 different institutes and the NCI is is the is is the largest one of all this is the best-funded one So there was a pilot project for the TCGA and and then a full project which is just finishing now and I mentioned the ICGC so when the ICGC in the International Cancer Genome Consortium started TCGA joined force and was part of the ICGC so ICGC and TCGA or TCGA is part of the ICGC That said the TCGA data is available. There's information on on there. They have a wiki to maintain and the way the TCGA project was developed over time is to Divide the various tasks not so that each center would have to do Qc or quality control on their samples or or sequencing and so forth so they they've just They've identified several centers that would do these specialized tasks and and therefore would have Centralized concentrated efforts for different tasks across the country across the US So the TCGA data includes Sequence reads so the raw data that's present at the cancer genome hub There's sort of analyzed data variation calls and so forth available on the cancer genome portal and Integrated also as I mentioned with the ICGC and later today you'll you'll be looking at some of this data in our labs so the the flow of information Goes from these various Qc places collecting of of tumors and so the the big challenge with the TCGA and the ICGC is always to get high quality samples and to get Patients that have this this specific tumor type you're interested in and to have them early enough in their diagnosis so that they have not been treated yet and early enough Or or high quality enough also of samples And so one of the first things that the TCGA pilot discovered is that there are many tumors that were Biopsied I were actually of quite low quality and that meant that any downstream analysis extraction of DNA RNA and Proteins and so forth was sort of moot if the if the tumor itself that you started with was of low quality and so they They caught themselves doing these these Sort of not the best quality of sample but because it's centralized they were able to change your SOP their standard operating protocols and Update and improve the workflows to to to make things better the other thing that we learned that the TCGA did and then we are doing also at the ICGC is To have a data coordinating center and this is actually NIH has learned that from its many years of funding many initiatives of many genome projects from from the last 20 years basically and And to have a centralized place that hosts the data Standardizes the way the data is presented and so forth and so that's been key For TCGA and it's been key for ICGC as well So the The data portal itself provides a platform to search download and analyze TCG data sets and They have two data access tiers and we you'll find that this comes around in other data portals as well So you have what we refer to the Open data and the control data so open data is things that can be freely shared without you Don't even have to identify yourself you can just download it Manipulate it and do whatever you want with it and the controlled access data is for all the data that is deemed Where you could re-identify should you have the right? Expertise The patient and so if you have a few snips for example a blood sample from a patient You could re-identify them within this database because you're looking at raw data from their genome And so any data where any data type that's considered re-identifiable So here we know we do not have access to names and addresses and home addresses Things like that, but we do have access to their genomic information and the genomic information itself is Good enough to identify Re-identify the person should you have some other DNA to match it with and so this is where So if we have access to genomic information to raw genomic information not mutation data, but Genome variants so inherited variants then that is deemed Identifiable and that is being controlled access And that's an important thing to keep in mind as we work with this data So the data browser which actually Is not a data browser I use very much because I use the ICGC one mostly and And you'll see that the ICGC has some of the TCGA information and then we'll sort of Explain that as as we go along that said if you have access to controlled access and from TCGA Sorry Michelle Oops, I have 15 minutes left 10 minutes left 1145 okay, I'm gonna speed up Okay, the ICGC so ICGC.org that's our home page It's an international project. So the idea of the ICGC is to collect 25,000 tumors from 50 different tumor types So it's 500 tumor per tumor types as 25,000 tumors, which is 50,000 genomes And it's a 10-year project and right now we're in year seven of this project So you think we're about three quarters of the way done, but we've accumulated about half the data only so the next quarter We'll down we'll finish the other half But that's sort of true for most projects and if actually we see this course accumulation of data You have a good idea of how Fast and it was slow at the beginning and as we go along it gets faster so On the home page in the top bar you have information about the various various panels that are available at the ICGC website and Each project you can sort of browse and you can click and you'll see for example the pancreatic cancer Which is one of the projects we do at the OICR is Takes you to the project that describes the project and so forth And so this is more who's involved what is the goal of the project which funding agencies and so forth But that's not that interesting if you're looking for the data. So if you're looking for the data the place to go is the data coordinating center the DCC site and that's available as shown to you here on the home page and For example, this is will take you to the DCC ICGC.org and so the next lecture is going to be all about this So I'm not going to go too much details about this, but well, you'll be spending sometimes This is the page where you'll be starting from what I'm going to do is I'm going to try to explain to you how the data that is ends up in this place where it came from and So this is a rather complicated slide, but it basically points out the two sources of Data that the main sources of data that compromise comprised of for the ICGC So ICG as I mentioned TCGA is part of ICGC So there's the TCGA part and the non TCGA ICGC part, which is the other half and that's all the other country So it's Asia Europe South America Yeah, and one African one sort of Middle Eastern country and so At the OICR We bring all of this together. So we get all the open data from the TCGA and All the open data from ICGC. We put that on a data portal and then all the controlled access data IE for example germline variants We cannot get those from the TCGA because those have to be behind Are held behind the TCGA so at the ICGC So the TCGA has a DB gap to control for controlled access and at the ICGC We have DACO which is data access compliance office. And so because we don't have DACO users which are ICGC Users that are allowed to look at controlled access data Do not have access to DB gaps You have to get two permissions from two different groups to be able to access everything And so once you get both of them And then oops, sorry once you get access to both of them then you can access the whole thing So as I mentioned TCGA is part of ICGC But there are difference there are differences between the two they have different tumor types. They have different Geographic rules They have many so different countries have different rules many countries versus one Jurisdiction so TCGA is only one country and ICGC is many countries And we have actually different definite. This is where it gets a bit complicated We have different definitions what we call controlled data and in different access rules as well so ICGC has these are the on the on the left hand side all the Open access data and on the right hand side the controlled access data And so you'll notice that at the bottom here. I think you see it. Yes that all Somatic variants from exome or a whole genome are controlled open So a somatic variant mutation whether it come wherever it comes from is controlled. It is considered open open data On the TCGA only variants that come from exome data are considered open Somatic mutations that come from whole genome data are considered controlled access And the reason there is that they don't the the US and the NIH and so forth do not consider And I don't know why they think that but they don't think that the Whole genome mutation calling softwares are perfect I don't know why but they don't think they're perfect. And so they think that what happens in the whole Genome effort. There's some variants germline variants that sort of filter through my accident And there are probably are some the way to identify those will probably take us 10 years and it won't matter anymore in 10 years But that's a separate the discussion not from here But it's important to consider that if so when we show you on the icgc data portal variants from TCGA tumor we're only showing you variants that have come from exome sequencing not from whole genome sequencing Otherwise if it's from a let's say Australian tumor It's got both the mutation there will be either from exome or whole genome and right now the Majority of the data is from exome. There's about let's say 10 15 percent That is from whole genome, but I think in the future the more and more will come from whole genome And so the difference between the two will become more apparent So there are some a lot of similarities though So all the icgc and TCGA users agreed to keep all computer systems You know things secure and and to protect controlled access data and to monitor the the usage of it and so forth And they've all agreed also to destroy it once they finish their work and they only use the secure transfer protocols and encrypt controlled access data and so forth and But if you look at this is one example of one file that we keep on our on the icgc data portal Which is a simple so what we've done is we've archived all the mutations From all tumors and we put them in this VCF format, which is the standard format for mutations so this file is Got all the mutations all the semantic mutations that we have at the icgc. So that means it has all the whole genome extracted mutations from non TCGA samples and TCGA samples is for and for TCGA samples. It doesn't include mutations that have come from Whole genomes because we don't see those because they're not open in data. Okay So This is the icgc portal overview So one of the examples I'm going to quickly go over is how to get access to this controlled access data and so You basically identify yourself you fill out a form detailing the project contact people the technology you'll be To assure that it's keeping the data secure and and the fact that you've read all the data agreements and then you All this gets put into a PDF you sign the PDF You get it signed by an authority at your institution that's able to fire you should you Not agree to or should you break any of the rules aligned in this document? and then you send it off to the DACO office, which is a number of legal experts that will bioethicist types that will review your application and then grant you Access to so this is the DACO form. This is identifying yourself And this is starting the process you fill out a form. It's a bit like an income tax forms And it's pretty long lots of panels and so forth and What you do is I mentioned all the various categories and recently as of last week We actually added a cloud storage issue so because we all know now that doing a lot of this work in cloud Is it different requires different permissions? So if you're going to use clouds you have to opt in to to agree to a number set an additional set of rules That you will be Abiding by if you don't want to use cloud computing for to look at analyze your data Then you can opt out and then you don't you you don't have to worry about those but there are Number of groups obviously in Europe with the safe harbor rules have been changed or have been Challenge recently and so forth. So we had to to to abide to Adapt to those changes. So we these are all the documents You said and you have to sign off on having read two different two new documents We've added recently is the the best practices on using being secure in the cloud and the global Alliance Framework for responsive responsible usage of genomic data and and as well So if you sign off you do the form and you validate it without having signed off then everything's read and something's wrong But if you check off all the things properly Then the validation gives you a green light and it allows you to print a PDF Sorry allows you to print this PDF that you can then sign and and get off to the Daco office And now on the website, you'll have the the list of all the Daco approved projects and And basically This gives you you now have the Daco approval You have to do a similar process You have to go through if you want to get access to TCGA data and it's there totally independent of each other And you have to do that one and there's also a cloud component in that one as well So quickly in the end, I'm just going to talk to you a little bit about the cosmic Database which is out of the Sanger Institute here in Hinkston and so the catalogue of somatic mutations and cancer and So basically the way Cosmic works is it gets its data from places like the ICGC and From PubMed so from reading the literature and it looks for all the papers that talk about some somatic mutation So you can think of cosmic and cosmic is predates ICGC and TCGA so they've been collecting mutations somatic mutations for a long time But you can anticipate that papers that Collected such data before whole genome sequencing or exomes were being done We're doing targeted sequencing or even they would look at one gene And so they say look at K-RAS and 10,000 different individuals and look at where the mutation was And that's the kind of information that goes into cosmic So today if you look into cosmic and look show me all the mutations on K-RAS You will come from those time types of papers that looked at specifically, you know 5,000 patients and They looked at that one gene and didn't look at any other gene and also will or they looked at a gene panel So they look at 10 genes or 20 genes and it's also includes data from ICGC slash TCGA but TCGA it's only getting the data from From exome sequencing not from whole genome sequencing. So right now in ICGC. There's about 12,000 Genomes represented and about 2,000 of those are from whole genome and about 10,000 from exomes so it gives you an idea of the ratio and what you'd be missing from from the TCGA side So cosmic as I mentioned some mutations only from various sources and There are lots of different ways it's a very rich website and lots of different ways of looking at the data there And also mentioned it's also integrated into ICGC portal so from the ICGC portal you'll see a cosmic link So you can go from a gene of interest in the ICGC portal to the cosmic equivalent So there's FAQs frequently asked questions available on their portal and lots of different ways of looking at your various genes of interest and So just a few screenshots to show you all that and so I'm finishing off now. So Nope any seat. We've only missed the best lecture And so in closing so remember that these sites have Lots of documentation The things are changing very quickly like I mentioned things I mentioned about TCGA today will be different next year We know they're going to be different because we're actually building the next one And so we are the changes are coming quickly There Don't be afraid to explore don't be afraid to go click on things and find out what's there and If you're interested in what you've what you're hearing about today There's we do a five-day workshop. So today we're doing a four-hour workshop We spread this into five days so you can imagine that we can do a lot more in five days And this is a different slide than the one you have your in slide deck because I found some other people They've been involved with the workshop and there's actually some more that I'm missing But the CBW is it's like I said has been running for 16 years and These are only people really from the last seven or eight years and it's really an exciting and Very great group of people to work with so thank you very much And if there are any questions, I'm gonna be here for the rest of day as well Okay, any questions or we move on to the next