 Okay, so we're going to get started with our next session. The obligatory but very important open access file. In case some of you don't know, I'm a big proponent of open access, and so this is something we did several years ago for the workshop, and it's only been beneficial for all of us. It's never been abused, except the interesting thing, sometimes I'll go to a talk and I'll see one of my slides. From somebody I have no clue who they are or who they know that I know, but they sort of went through several people and ended up in their slide deck somehow and without my name, so I sort of, no, I don't say anything. But it does happen, but it's a very interesting thing. So the other slide I have is that I allow, you can do whatever you want with my talk, basically. I use Twitter, so I'm at BFF on Twitter, and if you want to comment about this workshop, the tag for the workshop, if you're a Twitter user, although you should not be using Twitter during my lecture, is PoundCBW2011. So this session is called Visualization. It's actually a poor title for it, in the sense that it's really an excuse for us, for me, to sort of explain all the things that we want to see, but also the data types we want to work with and what they look like. There are many, many, many ways to visualize data. We cannot look at all of them. There are some, there's one that actually John mentioned this morning, Circles, which is actually becoming more and more popular, which I should have put in my lecture. I didn't. Is somebody else doing it? No. Yes, yes. Or some of them, and some of the ones I was going to talk about, I'm actually not talking about, because I knew somebody else was going to cover it. There are some issues with respect to visualization, which were a bit of a challenge for me, especially for gene expression, because it was going to be hard to sort of do visualization of gene expression without having covered gene expression. So actually Paul is going to cover that tomorrow, and I think you'll be, I will touch it on it a little bit, and some of the tools I think I'm going to talk about he will be using, and so it should all sort of work out. Like I mentioned, you guys are sort of the guinea pigs for this new workshop, and at the end of the workshop we have course evaluation, and it's a great opportunity for telling us Francis is totally messed up. He should have done his lecture on that day after that or whatever. He should have covered this and that. I will welcome that kind of feedback. Yes, good start, good start. Please interrupt me. Yes. You still, you have it now? So actually interesting observation here. This is three screen captures from the same gene, TP53, on three different types of browsers. And does anybody observe anything interesting? Noteworthy. Look at the fine print, it's fuzzy on purpose almost. Nothing. So yes, the middle one is actually a, both this bottom one and the top one are actually genome brows, it's taken from a genome browser, and this is taken from a gene browser. And so it's actually five prime to three prime for the gene browser, but the genome on the genome is the other way around. So it's three prime to five prime. So this is going to be sort of something to keep in mind. Always is that the orientation and some of the cues that the various tools provide are often very similar. I'll cover some of it in the lectures, but it's something to keep in mind. So I'm going to talk about what cancer data is and sort of follow up on what some of John talked about. Where is this data? What you can do with it from a genomic point of view, and this is what it's going to be talked about in this module. I will not really deal with epigenomic information, although there is some at some of the sites I will talk about. I will not really cover transcriptions or pathways, which are going to be covered by future modules later in the week. Again, about sort of genomic coordinates and file formats and things to keep in mind. And I'll spend a bit of time on IGV, which is one of the genome viewers and UCSC. So this is a quote, sorry, actually a book I recently read, which is actually, if you're interested in cancer biology, it's a very, very good book. Anybody read this book? Yes, one, two. It's a great book, I thought. Did you think it was a great book, too? Yes, me too. And it's basically the history of cancer and dating back several centuries, and so it's quite interesting all the things we do. And it's got Pulitzer Prize, it's got Top 10 of all the New York Times, and so on and so forth. So it's done really, really well. There's a quote in there from, actually quotes Bert Rogelstein about the Revolution of Cancer Research, which is that can be summarized in a single sentence. Cancer is an essence of genetic disease. We've modified this, we've borrowed this quote many times, and John has, and I have. It's like cancer is a disease of the genome, and I've modified it further by saying cancer is probably a disease of pathways. And so there's obviously, you know, it's sort of focused on the kinds of things that interest you. But it sort of points to the challenges at looking at this data. And so what do we want to do here? We want to basically, why are we studying this? We're obviously interested in prevention, we're interested in diagnosis, and treatment. There's not much we can do today. I'm not going to cover many things about any of these topics, really, but looking at cancer genomes, it's definitely not going to help sort of the epidemiology or the healthy lifestyle type of analysis that we need to do. Diagnosis, looking for markers of multiple types that can be involved in various cancer types, is definitely, and also help us find subtypes will require very different treatment. And family history, physiology, assessment, histology, transcriptomic and genomic information, and that's the kind of things that we're going to be looking at. And so that sort of would fit in some diagnosis. And of course in treatment and finding which proteins and or proteins in general usually are in pathways have been modified or need to be modified to provide normal physiological processing. Anna Hand and Weinberg wrote a classic review about 10 years ago, and they wrote a second, a next generation of it in just last year, or this year, earlier this year, and the hallmarks of cancer, and where they sort of quite very, very nicely sort of summarize the sort of the expertise of the field and in the various types of treatments that are available and the various types of generalization that one can make about cancer. And I definitely invite you to have a look at this paper if you're interested in sort of an overview of this field. Oh yeah, please put your laptop down. And I'm going to wait for the sound. So these are some of the things that actually John talked about earlier, but sort of setting it in the context that in the sort of 80s and 90s we had human gene expression sequencing, ESTs, mRNA, and so forth. Then we had the human genome mapping and sequencing. Then we had population analysis and polymorphisms, GWAS studies, and so forth in 90s and 2000s. Then there's this famous homework paper that came out in Plus Genetics where referring to some of the things that I'll include in the wiki, which basically said or demonstrated that you could identify in a sort of GWAS study when you have affected it in control group. If you had a few SNPs from an individual, you could sort of figure out statistically the probability of them belonging to one or the other. So if you had, let's say, a study on bipolarism and you surveyed several thousand people and you had somebody's SNPs, like 100 SNPs from an individual, you could say there's more likely that they belong from to the people that are affected by bipolar versus the control group. And that data used to be openly available to everybody, the GWAS study, and it just sort of clamped down. NIH just sort of went, I think, crazy on the other way by making all this data super hard to get, and not only was it under controlled access, but it was hard to get access to it. And now it's sort of adjusted a bit more, but that kind of data is still under controlled access because it's deemed identifiable. So a GWAS and, of course, a genome sequence is deemed identifiable of an individual and therefore should remain only accessible to scientists who demonstrate that they need to have access to that data to do the specific research that is entitled to be done by that kind of data for which it was consented for. So there's a whole sort of partition of controlled access and open data that really sort of became quite big and sort of actually relatively recent just a few years ago. The Cancer Genome Atlas pilot was initiated about 2006, I think it was, 2007, and just before the ICGC that John talked about, at the same time there was also the 1000 Genome Project. So after the one genome project, which was the first version of which came out in 2001, where we took, we sequenced one genome over 10 years at a very high price, then with the advent of next-gen sequencing, we were able to attack the 1000 Genome Project, which is actually somewhere like 2,500 genomes. And it's generated a lot of new data, actually from the 1000 Genome, which was open because it came from cell line that were open and available to everybody. We talked about the International Cancer Genome Consortium, ICGC, where I would call a pilot and it was a slow startup and now we're sort of getting into the full phase and then the TCGA is also in full phase and it's now part of the ICGC. And I'll get back to that later. So the whole idea behind doing all this work, of course, is that we think that genomic variations lead to our responsible for cancer. I mean, that's the hypothesis we're testing. We're saying changes in our genome is what causes cancer. And that's sort of something we have to keep in our mind. What happens, though, is that there's lots of changes in our genome which don't cause cancer. And so we have to sort of figure what's what. And there's all sorts of changes there. We're talking about somatic mutations, which are also referred to as single nucleotide variations, or SNVs, as opposed to SNPs, which are single nucleotide polymorphisms, which are genetically inherited changes. So SNVs are somatically acquired. Small insertions and deletions, rearrangements, all of these things which lead to this other area of research we'll study this week, this copy number variation, or also copy number alterations or CNAs. And there's lots of other sort of types of changes that are happening. This is taken from a figure from a nature paper from the UK Google Sanger, Ellen Presence, who's now at the GSC in Vancouver, actually, where she was first, where she did her PhD, post up in the UK, and then did a few nature papers and then came back to Canada. This is one on a cell line where they've looked at all the mutations and classified all the different kinds of mutations. And you can see that most of them are actually not, if you look at the ones that affect coding sequences, most of them don't affect miss sense. So miss sense is what would affect a coding sequence on a CDS and a coding sequence on a regime. It's only a small percentage of those are responsible for that. And only a small percentage, zero, of those are small insertions and deletions affecting coding sequence again. And the same for rearrangements and so forth. So it's a really, these events we're looking for are quite rare. They're quite rare if they're affecting coding sequence. Coding sequence represents what percent of our genome? 1, 10, 100, 0.1. What percentage of our genome encodes coding sequences? Two. Do I hear any other numbers? Three. Zero point three. So we're in a 10-fold range so far. That's pretty good. Anybody else? John? So he's right in the middle. So people use one, one-and-a-half as a number usually for that. So one percent of a genome, and so when you're doing an exome, for example, we talked about exomes earlier today, you're sequencing about one percent of a genome. So it's three gigs, and one percent would be how many bases? One percent of three gigs? Thirty megs. Thirty megs, yeah, very good. Good, quick math there. So that sort of gives you an idea, but what's the assumption there? What assumption are we making with all these statements? Not you, Michelle. We're making the assumption that cancer is caused by coding sequences. Or the mutations or the alterations in our genome that affect, that cause cancer are in the coding sequences. Is that a bad assumption? Yes, it is a bad assumption. It is a good assumption. I mean, it's actually not a bad assumption. I mean, it's one what we would call the low-hanging fruit. I mean, most cases we know of actually are many cases that we do know of, but we don't know of all the cases, of course, are caused by modified proteins, fusion proteins, deletions, missing proteins, overexpressed proteins, all of that. And what sort of, actually, what line of evidence sort of pointed us there in the first place before next-gen sequencing? Well, cytogenetics pointed to sort of rearrangements. In some cases, not always. But there are other sort of after cytogenetics before next-gen sequencing, give you a 50-year period, or a 100-year period to look at. Gene transfer experiment? Gene what? Gene transfer experiment. Gene transfer. That's... Yes, yeah, those are old experiments. Good old experience, yeah. But there's a lot of gene expression experiments, right? So we've looked at gene expression profiles in tumor versus normal, and we see gene expression is all over the place. It's very different in a tumor cell than it is in a normal cell. So we're sort of suspecting that there's something at the expression of the genes that's affecting cancer, that's associated with cancer, right? So there's lots of papers about signatures of gene expression that are associated with different tumor types. And those are... There's one hint that there's obviously a lot of things affecting proteins, but one could easily argue and many have that you could modify five-prime and three-prime sequences and modify gene expression, right? So this would be non-coding sequences that would be mutated and could affect gene expression. So that's something, obviously, the epigenetic type analysis are doing it, looking at as well. So there's a whole arena there, too. So it's not just the protein sequence. It probably is many of the times, and it's, like I said, the low-hanging fruit, so it's the easiest thing to look at, and we should definitely look at it. But we should also keep in mind that there might be some other answers. And so where do I plan my flag? What do I spend with all the rare events in large genome? So it's actually three gigabases, but actually we have two copies of it. And then with the heterogeneity that John talked about earlier today, there's actually six n chromosomes, I mean six gigs, two n chromosomes, six gigs, six n, what am I talking about? Copy number variation. So two n chromosomes, six gigs, but cellularity, when we sample a tumor, it could be 20, 30, 40 percent. But that varies, of course, from tumor type to tumor type, right? If you look at a CLL, it's pretty homogeneous. I mean, you can sort of enrich for your cells of interest quite easily. Pancreatic cancer is quite difficult, and there's everything in between. So we've got all these nucleotides that are moving and changing, deletions, insertions, a lot of single nucleotide variations. And then we have, with the first human genome, we have this coordinate system. So we now know where most, not all, most nucleotides live. And the world is actually, as far as genome browsers are concerned, it's sort of trilingual, right? Not everybody speaks the same language. Not all icons mean the same things. But at the very least, and it took them a few releases to figure this out, but at the very least, these three guys, the EBI folks, the NCBI folks, and the UCSC folks, actually all use the same coordinate system. So nucleotide one of chromosome one is the same for all three places. It's basically agreed upon. What the differences are, though, between the three places, and everybody else for that matter, is what do we decorate these nucleotides with? Which genes do we put on? Which transcripts do we put on? That will vary from one browser to another. So that's sort of a little challenging. There are some sort of clear, sort of, I don't want to say winners, sort of leaders in the sense that they, everybody sort of uses certain types of genes and put it as a subset. And the, actually, the folks at, it's hosted at NCBI, but it's actually a consortium of all the EBI, UCSC, and NCBI, and a couple of other groups that agree not only on the, this coordinate system, but they actually also agree on a core set of proteins, of genes, coordinates. And what they did is they said, okay, you show me all your genes, I'll show you all my genes, we'll overlap the sets, and everything that we agree upon is going to be the core set that we all agree upon. So how many genes do we have? Another great question. Let's see how far are they. So there was a contest, right, before the even genome. So in 2000, it actually started in 99, 98. And so at Cold Spring Harbor, we would meet there every year and we'd guess how many genes are there in the genome. And the further away you were from the finished product, the cheaper the bet was. So at first it was a dollar. So for a dollar you could put down your number of how many genes you thought they were. So in 96, 97, 98, that was a dollar. 98, 99, that was $5. And in 2000 it was $10. The year before the genome was finished, people were hearing rumors and so the cat was out, but if you wanted to bet a number, it was going to cost you $10. So the range of predictions, so I'll give you a ballpark figure, so you can tell me where it is, between, I think probably between like 24, 23,000 and 120,000. So what was the right number in the end? In 2001, when the papers came out. The first paper. 30? Yeah. Any other, do I hear higher? Do I hear lower? 20,000. Actually, yeah, so I think the number, if I, John, what was it when the papers came out? The two papers were 23, 24? Yeah, so it was about that. And the lowest one was about 25. And it was actually from Phil Green. He's the one that won that bet. He had bet the lowest number. I mean, nobody, everybody was around 50, 60, or a lot of, I had 30. I was, but I was way, I had a lot of people, I got influenced by a colleague. Yes? How much did he pay him? Probably about $100 or something like that. No? I thought Phil, from whose group? Lee Hood's group. Anyway, I thought it was, I thought, well, I heard a different story, I guess. This is why I have John here to make sure that my stories are straight. I did have a very big bottle of scotch. Oh, yeah. So what was your bet? I just bet people that had the closest to them. Back in the book it was, I bet, but I'm late. Oh, okay. Yeah. Anyway, so, but it's actually an added complexity. So, although we all have the same coordinate system, we now have different names for the same coordinate system. So over the years, you see from July 03, 04, 05, 06, used to be called NCBI, 34, 35, 36, and then you got renamed GRC, so which is this genome reference consortium, ULIN 37, so it went from, and HG16, 17, 18, and 19, that's the UCSC name. Okay. UCSC and IGB and most browsers do not deal with this little modification here, which is basically we have GRC37, GRCH37, and we have patch release one, patch release two, patch release three, four, five, and patch release six is coming out in September. So what's a patch release? It's a change in a specific loci that doesn't affect the coordinate system. So your gene X, unless it patches it totally on top of it, it won't move your gene. So the coordinates for your gene will not change. And so a patch level correction does not affect the position of any of the other genes. Okay, so that's a good thing. Because the big thing about any genome browser is you have to recompute everything and you can lift over one copy to the other copy and so forth and make copies based on the coordinates or based on the actual position of things that move and so forth, that's a bit of a headache. But you have to keep in mind, and if you go to this page and you scroll down, you will see all the changes that were made for patch five on 37. And there are, it affects thousands of nucleotides, if not more, and each of these patch levels. If it hits your gene, then it's changing amino acids, it's changing the environment, transcription factor binding sites, it's affecting all those things. Turns out that there's something like about 500 gaps still in the human genome, and so those are things where we don't know. We may know the distance, but we actually have no clue which nucleotides are there. And also there are the other, oh yeah, so the other, interestingly, a lot of patches, for example, affect chromosome six. So what's on chromosome six? MHC. So MHC is impossible to sequence, it's really a lot of repeats and so forth. So that one is a really tough one to get right. And so there's a lot of patch, they're working on it all the time, and some folks are really interested in MHC and they want to get it right, and so they're working on it. But that's an example of a thing that gets patched up all the time, every release. So we talked about this, or John talked about this earlier, so cancer data, so this is genomic information, but it's around clinical data about patients, about structured clinical data about the treatment, about the tumor, and we are trying to, in the ICGC and other projects, we're trying to capture that information and map that to a specific genome, and so that we have phenotype-genotype association. I mean that's the big goal here. And so what's the best way to represent all this? Well, we've been all the genome browsers, most genome browsers, and there are exceptions, and we'll talk about some of those later in the week, but we're really thinking about a linear scale. So we have a string of letters that are decorated with things, with annotations, right? And so what are annotations? Supplementary information? Yeah, that's one thing, yeah. What else? Non-information. Non-information? Okay. It's actually a classic qualifying exam. I always ask students on a PhD qualifying exam what an annotation is in bioinformatics. Okay. It's actually, let me short-circuit this one. It's actually an interpretation. It's actually your interpretation of what that nucleotide, that gene, that it's your opinion. You have supportive evidence, or you have weak evidence, you have strong evidence, but it's an interpretation as to what this nucleotide, that gene, that region, that arm, that protein, that pathway, and so forth are doing. It's our insight from the evidence we gathered, and we want to decorate the gene, because the idea is that once we've discovered and figured out what it does, we can tell the world, and so we annotate genes, and we annotate proteins, so that it's done once, and it doesn't have to be done over and over again. But ideally we have evidence, and we'll talk later probably, I think Gary will talk about evidence codes and things like that on pathways and gene ontology and so forth. And so all of that is our interpretation of what things are. So for mutation data, for cancer data, there's this great database called Cosmic Catalog of Somatic Mutations in Cancer. There are something like 5,500 papers that have been curated in Cosmic. And they've covered about 4.5 million mutations, and they've covered about 19,000 genes, and we said there was about 20. So basically in Cosmic, every gene is mutated. Are all genes responsible for cancer? Probably not. But of those 19,000, there's about 6,800 that have more than 100 mutations. So there's a lot of them that have very few mutations, but a lot of them have a lot of mutations. Or sorry, a small number have a lot of mutations. Yeah. So that's correct. So there are single nucleotide variations. They're different than the control, which would be another tissue from the same individual. So it makes them different. It's a polymorphism. It's a variation. It makes them a variation. You're right. It's not a mutation. Although it's a mutation database, so they look at cancer cells and cell lines and tissues, and they try to find everything that's different from wild type. So that way it's a mutation. It's different from wild type. It's not mutation from a Mendelian sense of the word, that it's not necessarily inherited and it may not get passed on, and you may not have a phenotype. Does a mutation have a phenotype? Does a mutation without a phenotype have a mutation? That's another qualifying exam question. I would say no too. From a Mendelian point of view. That's a very Mendelian sort of logic. That's how he was able to see the peas. The yellow peas had a yellow mutation because he could see them. That's not because he could see the nucleotide change, but he could see a phenotype. So if we see a polymorphism in Mendelian, we don't see any phenotype. The problem is that we see a phenotype and then we see a lot of changes. So there's a disconnect there. Actually, I think that's my next slide, but the next one. I'll get back to that point. So Stratton Campbell and Fertgrill had this review in nature where they sort of explained what they thought was happening, and basically they introduced the concept of passenger and driver mutations. And so there's some driver mutations which actually are responsible for the cancer, and one of the consequences of the driver mutations is to cause other mutations, which are side effects, basically, of the tumor, of the cancer. And so the driver mutations confer growth advantages to the cells carrying them and have been positively selected for during evolution of the cancer, and they reside by definition in the subset of genes known as cancer genes. Passenger mutations do not confer growth advantages, but it happened to be present in the ancestor of the cancer cell when it acquired one of these drivers. And so they may get modified, but they're not causing the changes. So they're passengers. I think in a way this is oversimplifying things, quite a lot, but it's a useful, again, it's a useful way to sort of separate the piles that we have to deal with. This first top pile is a very small pile, so we like small piles. The bottom pile is very big, we don't like big piles, so we'll get rid of the big pile, so that's good. So it's really a useful way of thinking about it. I have till when, 12.30? Okay. I'll just stop at 12.30 now. I'll go on after lunch. So the data that we collected, so we talked about the clinical data, tumor pathology, the age, gender, treatment, survival, and all of that kind of data is under controlled access that we talked about, so that basically if you go to the ICGC website, there's three buttons at the top. There's the cancer genome projects, there's controlled access, application form, and there's the DCC, the data coordinating center. So the form that you have to go fill out asks you where you work, asks you who your boss is, who can fire you, if you don't do things right, and he or she has to sign off that they will fire you, if you don't do things right. And you're saying that you will only use this data for improving, actually some, not ICG, either for the benefit of medical research or for even more restrictive for the benefit of cancer research. One we could argue that a lot of things are for cancer research, although they may not use the word cancer. I like to use that. A very broad mind, a very broad minded individual. So, other things which are, we talked about earlier today, is the germline data. So the SNPs, which are able to identify you, like John said, your fingerprint. If I have your full SNP, I not only know you, but I could also identify your children, I could identify your ethnic background, I could identify a lot of things. How much cigarette you smoked, but that's a separate issue. So, those first two there, although gender is not, is actually in the first bullet, gender is actually part of the open data. So actually we are letting, we're very generous, and we can tell you, if it's a male or female. So that's open. But everything else is controlled access. Germline data, so if we do, so what does that entail, germline data? It's like all the raw data, right? Your BAM files, all your sequencing reads, that's all controlled access data. Right? So you can build a full genome, and you can get, even if you're bad at it, you can get a lot of SNPs. Okay? So everything else, so somatic mutations, so mutations are not identifiable. Can I identify that it's me versus, I can probably tell that somebody has cancer by looking at their mutations, but I can't say who that individual is. And I can probably figure out that who you are, and when I ask a neighbor, does your neighbor have cancer? The neighbor probably knows that you have cancer. But still, that's not deemed identifiable. So somatic mutations are not identifiable. Copy number variations, currently not identifiable, could become. Maybe if there's enough work we do about copying number variation, maybe we'll be able to identify people by copying number variation. Currently it's not identifiable. RNA abundance and splicing, not the RNA reads. So a lot of things that we do to measure RNA requires the reads, the raw data from the RNA. So that's identifiable. So that's not available. But the summary of your RNA, so how many copies of each gene, how many transcripts for each gene do you have, that's not identifiable. So that's open. I can download that, no problem. Problem is people aren't making it available. But if it was available, I'd be able to download it. RNA sequence dimension control access, DNA methylation, that also will be open when people start generating that data. That's still under, there's a lot of technology development that's needed there. This is actually an older slide than the one John showed, but it's the same idea. ICGC, lots of people working together and generating. It's nothing that anybody can do on themselves. So 25,000 tumors, 25,000 controls, 50 different tumor types, and so forth. And we're about, right now it's almost, a bit more than 50% of the projects have been committed. So people have said they can generate. This slide John showed, and this is the page for Canada and Australia. So Australia and Canada for pancreatic cancer are working together on this project. You have all the details, and I think, John's name has been cut off. It's on that page. If you look at the bottom there. And he's there. So getting the data. So we talked about Cosmic. So Cosmic has mutations. So it only has mutations. So it just tells you what the mutations are. Which tumor types is the person, did the person die from this? I think they have that. Publication, which PubMed ID, what diagnostics on the tumor, and so forth. But all sort of made open. ICGC homepage has a link to the ICGC data, which is a DCC. So data stands for data coordinating center. So that's where all the data for the ICGC lives. TCGC home. Yesterday I could not get to this page because of the hurricane. All the computers were down on NIH campus yesterday. But they're back up this morning. And somehow they managed, I don't know if they had the data somewhere else. The data was up. So that was available. So for looking at data, so there's the UCSC cancer genome browser, which I was going to spend some time on today, but I decided not to because of it's basically a lot about microarray and gene expression visualization. So we're going to leave that until tomorrow. The cancer genome workbench is basically a derivative of the UCSC workbench for cancer genome data. We'll look at the UCSC and so we'll get back to that. And the integrated genome viewer is a tool from the Broad, which is the more I look at, the more overwhelmed I am. We could do a whole week on this tool. And we're going to do half an hour. But it's quite overwhelming. And the thing about the IGV was initially started as a gene expression, microarray type viewer and then it morphed into next-gen sequencing viewer and then morphed into pathway viewer integrator and morphed into, and it's doing all of that always. And so it's definitely worth the time to look at it, but we may not have all that time this week. But we'll have some. This is an updated version of the filed slide that John had. You have to put the new one with the right web page in the middle. So the idea behind the ICGC-DCC is that which is hosted here at OICR is that it is using Biomart engine which allows a federation of the various data centers and ideally the ultimate goal because we're still in a young stage in this project is to have every genome center that's generating data to host most of the data locally and then through the magic of the internet federate all the data across all the various centers and we'll have a quick look at that later. Why do we do it that way? So the idea is that when you're looking at large data sets and here we're talking about the same way when the first data sets of a thousand genome project which was like a few hundred genomes at the time were being transferred between the EBI and the NCBI because they like to have copies of each other's data it basically plugged the transfer between transatlantic internet basically they occupied all the bandwidth between the two sites and so we have to think about we have to get more internet, more bandwidth but we also have to think about how is the best way to share these large data sets and there's definitely lots of large data sets involved with the ICGC not only the sequence data but there's images, there's and there will be lots of other clinical and a lot of summary data and so forth so one way is to centralize everything so centralizing is actually easier in a way from an engineering point of view you have everything in one place and so you can sort of figure out that it all fits in the box except then you have the problem about everybody sending their stuff and so that's one issue and then the second way is to have things federated so that everybody has copies of subsets but then through a single user interface you then share everything the thing is that you get the best performance on this way but then you get the best flexibility and the most important flexibility with doing it the federated model is that when you add other nodes to the model then it doesn't get slower it actually just continues doing basically well this first model if you add more and more nodes you may sort of single pipe go into this one place it will actually slow things down and so that's the the way Biomart and the system which is the back end of the DCC is operating and it turns out that Biomart is actually used by a bunch of other databases so it becomes even easier to integrate ICGC data with other data sets Reactome, Assemble, HapMap and so forth and there's lots of other tools like Bioconductor and R and Galaxy which also talk to Biomart so that it makes it easier to integrate with other tools as well so if you go to the ICGC data portal the dcc.icgc.org you're faced with basically a summary of the data that's currently available and the ability to do simple queries by just typing the G name and the search box there or to actually look into sort of quick, flexible or advanced type searches and the sort of the more serious and courageous you become going from quick to flexible to advanced the more options and parameters you're allowed to use that said as of a few months ago and I think it hasn't changed much since July there is actually very little data that has been submitted to ICGC or as part of the ICGC family and the biggest biggest subset of data is actually from TCGA so initially TCGA was sort of an observer of the ICGC activity and now they are full active members of the ICGC so the TCGA which is the American cancer genome atlas is currently overwhelming the ICGC I mean they have so much data and so much more of it than all of ICGC put together so that to say TCGA and ICGC is basically saying the same thing it's basically TCGA but what we're talking about right now this is actually the open access data and we talked about control access data like the reads like the clinical information and so forth and that we still have to resolve and so the reason we have to resolve that is that the ICGC and the TCGA which are one but separate are now using different sort of authorities to validate who's allowed to look at the data so we have two different sort of sets of lawyers if you like that think that they're right wrong and saying you know we are deciding it's no no we are deciding and so it's sort of a bit of an impasse right now that we have to resolve soon between if I'm allowed to look at one data set by one group of lawyers slash ethicist slash bio ethicist I should be allowed to look at that should automatically give me access to the other set it doesn't work that way right now but so there might be sort of compromise and so because all the ICGC data is actually being held at the European archive that has got a sort of model for access and all the TCGA data is held in D.B. Gap which is the American NIH archive held at NCBI but police by building one at NIH so right now it's not transparent we're in transition we're going to resolve it but it's not perfect so right now you have to apply both places to have to look at complimentary data sets and so you should be able although I'm not sure if you're allowed to do that to actually mix your data once you get approval on both sides and put it in one computer and analysis and so forth the thing is ICGC and TCGA are doing large scale analysis of all the data sets as well and so of course it would be great if one place had everything right but that's currently there's room for lots of discoveries here for you guys after this workshop go home apply for access to ICGC apply for access at TCGA put all the data on your hard disk crunch it all up and write papers except there's an embargo period too so that's another sort of another caveat so the idea is that we want to liberate the data we want to make the data free for everybody to use and this is irrespective of controlled access or open access open data one caveat on the ICGC data pretty much the same for everybody is that all the data is freely available to all even though group X has not written a paper yet up until they write they have submitted 100 genomes so they're going to do 500 right but once they have 100 and then from then on you count one other year so it's 100 plus a year still no paper you're not allowed to write you're not allowed to write a paper about that 100 set so the idea is that we'll make the data available but the global analysis we're going to leave it to the people that generated the data so we're making the genomes available so you want to go look for your gene or you want to look at a small set of a few genomes to see to validate something you can do that and you can reference that data and so forth and you can make it available for you to look at but if you want to do an analysis of the whole 100 genome oops is that me? sorry ignore if you want to do a set of the whole genome you have to wait one year if I put 100 genomes on the FTP site and a year later I still haven't done my paper you say okay I'll write a paper hopefully I will have written the paper or John John will have written the paper you're writing the paper now aren't you John? yeah yeah so we're not taking any chances and so the idea is that so all this is documented not perfectly but we're working on that too so it's a new project lots of new rules lots of new things to and very important things to be aware of okay yeah so there's a lot of like the TCGA ovarian paper so once on the other thing that can sort of short circuit the whole thing is if I write a paper so if I write a paper right away then the data is free for every way to go do right and so you have to go look at each data set is the paper been published how long has it been there and so I said the math is a bit more complicated I thought it would make it simple I'm going to add another complexity so the first the other clock there's many clocks one clock is I deposited data I'd say I deposited 10 genomes or I deposited 100 genomes and then a year later if I haven't published you can publish about my 100 genomes even though I have 150 but some of the 100 have been there so you can talk about the first 100 so it's almost like two years almost to write a paper so I should have written a paper but I haven't the other clock that goes is that if I put in 10 genomes and two years goes by I've only put in 10 for two years I'll get my hands slapped by the director of ICGC but then after two years if I haven't written a paper with those 10 papers or 99 genomes then they're free game too so there's actually a paper about this about data release and the explanation of this whole sort of embargo period and why so it's important to free the data but it's also important to respect the scientists that generate the data to give them the opportunity to write the whole paper and he has a quote, yes there was a recent some kind of line work decision about the gene patterns in states are there any problems here not that I agree so what was the gene pattern about? all of the breast cancer what are you talking about? so it did affect I think it's a genational in my space but I think it affected so some people are choosing to ignore it actually I'm not a lawyer so I better not answer that yes so this is number of data sets so it would be let's say pancreatic cancer or OICR that's us at that point there was actually a whole slide so there was one sample we had structural rearrangement and one for single mutation there's more than that now is it a genome plot? yes, it would be a full genome so some of the gene expression here these are not necessarily RNA seeds that could be asymmetric so the whole genome analysis of gene expression okay so it's 12.33 so we're going to break for one hour