 Rwy'n gweithio. Mae'n ddweud am gynnig i'w ddweud i'w ddweud, mae'n gyfathodd i'w ddweud yma i gael eu maen nhw'n ddweud yn y ddweud o ddweudio gyda'r ddiwrnod. Felly, rydyn ni'n ddweud ar y gennafyr i'r ffordd. Fodd i'r gweithio i ddim yn y gweith. Mae'r gweithio i ddim yn y gweithio i ddim yn y gweithio i ddim yn y gweithio i ddim yn y gweithio. Fodd y ddweud. Mae'r ddweud i'r maen nhw. It is I would you believe I did some wetwork in my life when I was 19 or 20 or something like that. It was just before DNA sequencing really got into gear and I was part of the generation llawer y generation I think, who just was in that transition, most of my life I've lived with the genome. That's a little bit of my scientific life, I live without a genome. And I'm very aware that modern scientists growing up, just take it for granted. And that's a good thing. It's really good that current researchers, don't worry about it. They just go off and use it in all sorts of different ways. Mae'r cyfnod o'r ddawn yn ymweld yn fwy gŵr. Cyfnodd Kefyd yn dda i'w meddwl. Mae'r tynnebau fawr hefyd yn dweud i'ch ddefnyddio'r cyfnod y genomes. Mae'r genomes hynny yn dda. Mae'r ddweud o'r cyfnod o'r ddechrau, mae yna ddweud o'r ddaeth eich drafodol yn ymddangos. Mae'n gweithio'r ddweud o'r ddweud o'r ddweud o'r ddweud, offered of the problem with various metaphora sqищ saving all different types of experiments. But byologies, the genome really ends up being a kind of ultimate biological index. It may not actually tell you precisely what's going on. But it's a place to say, here on the genome is this molecule or this molecule is controlled by this point on the genome. You may not understand everything that's going on Here it is. You do that by mapping where the molecules come from, sometimes biochemically and sometimes you do it by mapping phenotypes and things like that by either using genetic techniques forwards and reverse genetics and also using population-style genetics that Nancy just mentioned. As we go from that sequence, we build up a series of things that we hang off the genome, and these really are using the genome as the index, so we have a protein-coding gene structure that we perhaps understand by the RNA that it produces. We have variants that are across the genome. We have other things that we can measure. When I describe all of these things together as genome annotation, in fact, over the last 15 years of my life, this is what I've done. I've been trying to, in effect, draw boxes on screens that say this bit of the genome is involved, we think, in this molecule or this process or this aspect. And there's a theme that goes through all of these different areas. There's a business of generating high-quality data with good data standards, and rather critically, it's very obvious, but all of that data has to be available publicly. We integrate that data, and we use the fact that the genome provides this index to make things non-redundant. And then we annotate, and it has to be stressed that I think that we're really still very, very much at the start of our process of understanding ourselves what all the different molecules that come from the genome do and how they play together. And I think this science will continue. It's really the science of molecular biology. It's not the science of genomics. Science of molecular biology will go on for at least another century. I will still be doing this until I am old and grey. I'm getting there. And there will be future generations still doing this. And on the way, I think, we'll be making lots and lots of other discoveries about ourselves and about other animals that will help us understand, for example, disease. So this is a point where you have the pleasure of looking backwards, and here are some rather dapper men who I know, so that's John and Bob. And this is the exciting bit here that says that it's going to cut. This will be delivered five years ahead of schedule, those great year times. And it was about at this point when I was a young graduate student working with Richard Durbin, and there was this problem. And the problem was in this green stuff. There was this rather stately process that was going along in the genome project about producing very high quality data. And then suddenly everybody just let everything loose for a period. And people said, my gosh, we can't do it. We can't process the information, understand the information at the matching rate. We need to understand what molecules, what genes are part of these sequences at a faster rate. And so they basically had to shift a lot more into computers. And these are genuine slides that I presented back in 2001. And I remember, in fact, these things. If you can remember those days, the genome was in a bit of a mess. It was a bit of a headache. It used to drive the old style single gene geneticists up the wall because there were like thousands of pieces on the floor and you were trying to put them back together again. And we had a variety of ways of sorting out how to make the draft genome sequence more consistent with gene structures and actually predict gene structures inside of that. And it was also the start of, I think, more extensive or more thoughtful processes about delivering this information to a variety of scientists. And this is a very, very early screenshot of this system called Ensemble that I was involved in founding. Paul Fleachock, my colleague, now runs this. And in fact, you can really see that parts of these pages have changed a lot, but other parts, which is that your drawing boxes have not changed a great deal. And there's a wonderful piece of history here because it was before we had these data sets, we actually didn't know how many protein coding genes there were in the human genome. And it was an interesting period because I kind of grew up in a scientific community where you read a book and it said there are 100,000 protein coding genes in the human genome. And you said, sure, that's what I hear. That's what I'm going to think is the right thing. And as this data came out, this was when chromosome 22 from the Sanger Institute came out. And chromosome 21 that came out with the consortium Stylianos and the Japanese were both involved in chromosome 21. And it was very interesting. They're very different chromosomes. They have very different gene densities. And at the time, you can see that the number of genes predicted from these two things, 50,000 from chromosome 22 and 20,000 from chromosome 21. Now, it's really interesting. This number was considered to be really low. It was jaw-droppingly low. I can remember sitting in the room and having people going, no, it can't be that the human genome has 50,000, only 50,000 protein coding genes. And in Ensembl, we had at that point found evidence, confident evidence for 38,000 protein coding genes. We thought, we'll come on to this in a moment. And so we were even more embarrassed. We were getting even more complaints about us. And the consensus opinion was that chromosome 21 was a very gene-poor chromosome. And that one simply couldn't do the naive piece of scaling up from chromosome 21. And I think I actually met many genomicists for the first time when I was in Cosmic Harbour and I presented, I think, that slide. And I held up a book and I said, we're going to bet on this. And in this year, I'll charge you $1. Next year, I'll charge you $5. And the year after that, I'll charge you $20 because our information will get better. And you buy a number, it's a sweepstake, and please bet. And this is the distribution of bets over those three years. Now remember, at Cosmic Harbour, these are genomicists. They all should know what they're doing. And they were top of their game. All right? The real number is on the left-hand side of this. Now the conclusion I draw from this is that when you get a crowd of very intelligent people, if the very intelligent people have no data, they don't make good estimates. When you give very intelligent people data, they make good estimates. And the true answer, by the way, is still, it's closer to $21,000. And then you get a very long argument about small orfs. And that argument can go on for an incredibly long time. But it's not going to go north of $22,000, $23,000. And in fact, we were so confident about that in 2003, that was the end of this betting process, that we were able to split the money for the three people who bet the lowest. And the reason why is, in fact, a Frenchman, Ugros Crolius, had made the most outrageous statement back in 2001 that there were only 26,000 protein-coding genes. And everybody was going, and so he bet his number, which is great, and then two other people, sneakily, bet just under him. So I thought it was a bit unfair just to give it to the lowest better in that situation, because Ugr had really anchored the process at the bottom. So that was about genes, and this story of generating information, integrating it and annotating it, you can talk about variants, and other people have talked, Nancy's just talked, and other people have talked already about how really bringing old, very old pieces of quantitative genetics now back into the modern age is incredibly empowering. But I'm going to talk about this other thing, about the other stuff in the human genome, and this is a variety of projects, again integrated and annotation, and this is about outside of well understood protein-coding genes, what are the things that are switching things on and off? And again, lots and lots of elegant molecular biology on individual loci have said, of course there are things that are switching genes on and off in different places, but we don't have a sense, we can't draw boxes, we can't say, ah, this base, this switch, this cell type. So for the last 10 years, and thankfully, not thankfully, but I'm no longer quite so in bed with this, but I spent a lot of my time on this project called ENCODE. So I'm going to now give you an incredibly short whistle-stop tour of ENCODE. So just to get your head around ENCODE, there are three major axes, the different experiments that we did, the different cell types that we did those experiments in, and of course the genome. There were 164, I'm sorry you can't read that properly, 164 different types of experiments in the paper that came out last year, and most of them are different chromatin immunoprecipitation experiments, chip experiments on different transcription factors. 182 different cell lines or tissues, there's quite a lot of primary tissue, but there's six key cell lines that we use. Three of them are classic biological laboratory workhorses, so HeLa, HepGG and K562. Three of them are normal, carrier-type cells. One is GM12878, the best-known genome on the planet, because she is the daughter in the 1000 Genomes trio, and that's a lymphoblastoid cell line. The other one is H1ESC, a stem cell line, and one of the reasons that was chosen, it was one of the six bush ticked stem cells in the bush era, and the third one is primary cells from umbilical cords, huvex, and they have, they do get a couple of doublings in the lab, but not many. And so the majority of these assays are done over these six cell lines, and then there's about 20 assays that's done over this range, and now this would go up somewhere in the ceiling, and this would continue somewhere over here. And it's actually a modest-sized experiment these days. Cancer projects are much bigger data-wise than in code, but what it is complicated is very high-dimensional, because each experiment realistically is another dimension, and each cell line, or each experiment cross-cell line is another dimension on the genome. So it's a very high-dimensional data set to work inside of. And this is a consortium effort. I'm just one of the members of this consortium here, and there's a whole bunch of other investigators, I think some of whom are in the audience, 410 authors on the main paper, it's one of these crazy things. But I'm going to come back to these guys. These are the lead postdocs in the consortium, and there's a very simple statement we say. We say, we generated high-quality data, full-stop, and then we move on. And in fact that simple sentence hides an incredible process of generating at scale this data. So if you want to know about running more chip-seq than you've ever seen before, Flo Pauli from Hudson Alpha is your person, if you want to know more about RNA extraction and separation that's Carrie Davis at Cold Spring Harbour, if you want to know how to process an insane amount of histone modifications, that would be Chuck Epstein at Broad. Now I can't do justice to the INCOPE-L and COGEE to read the paper, and I'm going to take you on a sort of little sort, the names here are going to change by the way, so keep track a little bit of those names. There's really two ways. One way is when we look each experiment, one by themselves, and one where we join experiments up. This is an example of two of an experiment. This is a DNA's one experiment done in UW by John Staminopoulos's group, and you can see that we get enrichments across the genome. This is a very old piece of biochemistry known from the 1970s, and we've taken that piece of biochemistry, or John has taken that piece of biochemistry, and rather than using a classic DNA southern blot style approach, he now slaps it onto a next generation sequencer, and rather than reading off bands on a gel, you now read enrichments across the genome. You can see here that we have peaks, these are two biological replicates, and over here those two peaks look very credible, but look at here, here's one of the peaks on this side, and there's nothing here. So this question is, do you believe this thing here? Now this is a statistical problem. If you're a statistician in the audience, you'd say that's a problem. We've got a variance problem here, and when I good grip on the mean and the variance, if you give me about 30 replicates, I'll nail it for you. And if you're a biologist, you'll say, you must be joking. Have you got any idea how complicated it is to grow up 10 million cells for each of these experiments? Plus, if we did 30 replicates for every experiment, I would divide my budget by 30, and we'll do 30 fold less assets. And so you could probably argue the statistician down to about five replicates, but most statisticians dig their heels in somewhere between five, you know, seven, five and four. But thankfully, we had a very gifted non-parametric statistical group by Peter Bickell, and we were able to argue a piece of biological intuition that he, that group, then turned into a hardcore non-parametric statistics. And the intuition is that we have a whole genome's worse than this. So when we're assessing this little case here, we're doing it in the context of what we've seen elsewhere, and so what we're plotting here is the log of signal for one of these experiments for one replica on the x-axis and for another replica on the y-axis. And at the top of the signal space, it's correlated, so if you're high in replica one, you're high in replica two. But as you go along, you get points like this, relatively high in one replica, but really quite low in the other replica. And the intuition is that we have a signal portion which is correlated between the replicates and a noise portion which is not correlated. And once you make that assumption, you can actually write down a piece of non-parametric statistics that will draw a confident boundary between the noise and the signal, and we've coloured that in this red-black. Peter calls this the irreproducible discovery rate, and it's a piece of statistical innovation that's been driven by this piece of biological data. Now, from this, you can do a rather simple thing which is just sum up the number of bases by all the different experiments that we've done. And that total sum comes out when you do it over all experiments and all cell types at 80%. But this includes RNA and histone modifications, and we know that we surveyed H&R&A and histone modifications involved in transcriptional processes. So I try and emphasise this, you shouldn't be so surprised by this 80% number because, in fact, the genome is full of a lot of genes and a lot of introns. However, something that I was, on myself, much more surprised at these middle numbers here. In DNAs 1 hypersensitif sites, we get 15% of the genome covered by DNAs 1 hypersensitif sites, and 8% of the genome covered by this is transcription chip-seq cases. Now, a sceptic might well say, well, in these cases you're only measuring an enrichment, there's actually a very small number of proportion of bases that are triggering that signal. But we can very often find that. In DNAs 1, there's a approach called DNAs 1 footprinting. It's identical to what you did with Southerns, but you do it with next generation sequencing. For chip-seq, you can find the bound motifs that are the specific base pairs here. Again, this is quite a high amount of the genome, and cumulatively with the exons, the coding and non-coding exons, it's coming up to 9% of the genome. Of course, code didn't look at every transcription factor or every cell type. So there's a lot of different cell types and what we could do, though, is estimate to what extent we have seen all the elements. So this is plotting out how many DNAs 1 element when we've seen 1 cell line, 2, 3, 4, 5, all the way up to 60, whatever, and it's non-redundant. So at the start, we accumulate more unique elements, but you can see it starts bending, but it doesn't flatten out. Our most aggressive fit suggests that we have seen 50% of the elements so far. And that's, we know going to be an overestimate of how well we're seeing elements because we know that there are inaccessible cell types or transcription factor classes that we aren't so good at assaying. So a lot of the genome, a lot of the genome, the vast majority of the genome is close to a biochemical event. The second statistic I think is the more important one is that it's really close even to one of these band motifs or footprints. In other words, there are transcription factors or specific DNA protein contacts across all of the genome at some different place. Now, we've got a long road ahead of us to understand how these things are used and if they're used at all in different mechanisms, in different diseases, in cancer, in different scenarios. But at least we have a catalogue. We have a place to start to ask all of these questions. And Nancy was talking about the amount of GWAS cases that lay outside of the genome. And although I wasn't here, I'm guessing that David was talking about the amount of selection on non-coding DNA in the STICL map. So there are many other stories about ENCODE which I can't do justice to here. So it would be great. So do, if you can get Mike Snyder to give his talk, it's really wonderful about transcription factor co-associations, Roderick Gigo on splicing histone interactions. You can learn more than you ever wanted to know about RNA from Tom Jun Jaras. DNA is one footprint. This is a beautiful set of experiments by John Staminopoulos and DNA methylation by Rick Myers. I want to just jump into one case. This is a case where we took a step back from the data. And now rather than looking at this experiment by experiment by experiment, we tried to jointly analyse all of these experiments together. And we used two rather classical machine learning techniques hidden Markov models in a dynamic Bayesian network. And what in effect they do is they colour the genome. The good thing about these techniques is we don't tell the method how to colour the genome. We don't say please learn about genomes. We just give it the data. Please organise yourself however you see fit. And then we compare that machine learning to well understood annotation. And in fact it's rather reassuring that the strongest signal is about promoters, about transcription start sites. There are some unexpected cases. There's a chromatin state that sits over three prime ends, over poly-A addition sites. That was really unexpected. But we'll dive into these things which I describe as reassuringly interesting. And these are colours which when we compared them to annotation, they were not close to genes. They were not at transcription start sites. But they did have activating chromatin marks like H3K27 acetylation and the classic H3K4 mono methylation which is a classic enhancer signal. I thought that these were putative enhancers. And so we took a random sample of this. Remember that we had not trained at all on these things. And then we did both cell line experiments, we did mouse experiments but we also did this which is my favourite fish. And this is a medarca fish. There's his head, his body is going around here. This is his heart pumping blood around the yolk sac. And each one of these little dots is a blood cell and blood cells in the fish are nucleated. And in this medarca fish we have a piece of encode, an element predicted by encode to be an enhancer coupled to green fluorescent protein. And so each one of these dots is a piece of human DNA creating green fluorescent protein in the red blood cells in this case of this medarca fish. And we get a good result. It's about 50% of the time in these different assays we see a specific enhancer activity. I have not been tracking time. Okay, cool. So this I think is for me one of the nicest results that came out from encode. And I think you've heard already by a variety of people about the incredible progress of genome-wide association. You still I was reading the index of nature genetics and seeing yet another four GWAS things. And I think it's great. It's an industry but every human disease deserves a good thorough genome-wide association to nail down all these regions. But as Nancy says, very often these things are found in the non-coding parts of the genome. Now what one assumes it was happening in a genome-wide association is in fact a very simple statistical technique where one's typing SNPs across the genome but you don't have to type every SNP because there's a correlation process that happens in the way that DNA works in the population which means you only need to type one in 10 SNPs or one in 100 SNPs to effectively capture all the correlation. So this is a SNP that is reported but one assumes that it's tagging a SNP that's in some functional region. And we have a catalogue of these, the reported SNPs associated, statistically associated with phenotypes. So this plot, when I first saw this plot I remember getting really, really excited and then I got really, really sad because I realised that there must be a mistake and then I got excited again when I realised that there wasn't a mistake. So let me talk you through that peak and trough of this. So the excitement here is that these are different DNA encode annotations DNAs on hypersensitive sites or transcription factor binding sites and this in the red is the overlap to this GWAS catalogue organised by NHGRI, Terry Milonio and colleagues. And then these two blue bars on the right hand side are background SNPs from the 1000 genomes from the 69 complete genomes. And you can see the red is way above the blue. And so you think, yes, encode is annotating GWAS SNPs and then you think, oh no, no, no, no, no, no. This is all wrong, we have a bug because all those GWAS SNPs must be the tag SNPs. They shouldn't be the functional SNPs. We have a problem in our programme. And so four different groups inside encode went on about a nine month bug hunt to explain why we're doing this. And the first and perhaps most kind of curious and interesting thing is when we look in particular at the older genotyping arrays, the very early AFI and alumina arrays. And in fact, there's about a 1.3 or 1.2 enrichment for SNPs for functional regions. Now I describe this a bit like Monopoly, if you ever played Monopoly, when you draw the community chess card and it says bank error in your favour, please collect $10. This is design bias in your favour. There were more functional SNPs on your array than you previously thought. If we could teleport ourselves back to 2003 and say to the chip design group, you know what, you can design a chip that has all the LD properties that you want and we're going to enrich it by 1.3 for functional variants. They would have gone brilliant. Yes, that's the chip I want. But they did that unwittingly. And understanding how they did that is kind of interesting. So one of it was deliberate. They did try and get SNPs closer to promoters. That doesn't explain all the signal and we think, or I think, the multiple processes for PCR optimisation was implicitly selecting for open chromatin because open chromatin extracts better and therefore PCR is better and therefore the SNPs inside of open chromatin end up on your arrays more often. When we stir in as much as we can to explain away this difference, that's this match null distribution here, we get close to this red bar. So this is distance to the promoter, allele frequencies inside of introns, intergenic, we cut the genome into up into five different ways in five different directions to deal with gene density or conservation density or SNP density. All of those cannot explain away this signal. So we end up saying there are a significant number of functional SNPs in this GWAS catalogue. You might say to yourself, well, that doesn't sound so interesting. I knew there must be some. I just didn't know which ones they were. This is not a very interesting fact. But the great thing about this is we can now take that signal and we can break it down into two axes. Now, the first axis here is encode types, different transcription factors in different cell types, and that's in green, and that would keep on going all the way around here. And then this is in fact DNAs one and this would go even further but I'm just showing you a subset. And then down on the left-hand side is different phenotypes or diseases. It goes height, SLE, Crohn's disease, etc. I'm just going to zoom in here. So this is a subset of that table which itself is a subset of the bigger thing. And at the bottom here, here's Crohn's disease. Now let me take you over here to Crohn's disease on this side. Out of the 20 loci that are associated with Crohn's disease that overlap with TFs, nine of them are active in T helper cells. Now that nine out of 20 is highly significant. These are very rare data sets in the genome. The genome is a very big place. And I can't agree when it goes over a chi-squared threshold. But this is not surprising. Crohn's disease we've known is an autoimmune disease and it's awful gut disease. And we've known that it's been involved in T helper cells for a long time. However, this one here is much more surprising. Crohn's disease associated in this case with gut or two, a transcription factor. This makes sense. The gut or two transcription factor is involved in blood differentiation. But the Crohn's disease biologist didn't have the gut or family on their list of things to look at for the molecular etiology of this disease. And each one of these green and red squares is a hypothesis that links a particular transcription factor to a particular disease or a particular disease to a particular cell type. Many of which I think are previously known and understood, but quite a few of which are novel and interesting. And this, by the way, is a great list. A classic list I just illustrated. It's actually in a gene desert. It's a notorious region because of the different immune-related SNPs in this region. And here's a SNP that lines up bang in this region that's activity health cells and has a gut or two. Now, I won't go into this. We did encode not for nature papers, not for me to give presentations, not for anything else, but for the rest of the community to use it. And there's all sorts of different ways to use it. From the raw data, there's this wonderful, great, very Californian way of looking at the genome here. And I think it's beautiful. But there is this sophisticated, even more beautiful European way of looking at the genome. And if you like your P&I's Californian style, I understand that. I also like that. But occasionally I like to go back to the old country and have a classy bordeaux or something like that. And if you want to feel the true richness of a European experience, come here. And more joking aside, there is friendly competition with our wonderful colleagues at UCSC about trying to deliver this complexity of information to it. And if you haven't met this thing, it can seriously make your life easier. This is called the variant effect predictor. And these regulatory elements are now integrated into this. If you just present this tool with the SNPs that you have generated, variants that come from your exomes, you do not have to keep up to date with all this regulatory information. Ensembl will do the heavy lifting for you. So, very quickly, I just wanted to mention one thing, because it is about DNA. And my colleague who is a mathematician, this is Brits and Pubs. I don't know quite what it is. But we were over a beer and Nick Goldman said, at some point all the data we are going to store is going to be DNA sequence. And we are thinking about very cost effective, low electricity digital storage devices. And we said to ourselves, wait a second, we know a zero electricity nano digital storage device. It is called DNA. And so in fact, we created a scheme for storing using DNA as a hard disk. And I won't go into the details here, but one gram's worth of DNA can robustly store two petabytes of information. You could fit the world's information on this stage in DNA sequence, in zetabytes, with all the redundancy required to make that work. And we did a little bit of economic modelling here. And the take home message is that if you wanted to store something for about over 500 years or over a thousand years, we believe it is cost effective now to store stuff as DNA if you have a thousand year archiving thing. So if there is somebody from the Smithsonian here who would like or the Library of Congress, who would like the information the Library of Congress robustly stored for a thousand years, I have a solution for you. And that Nick do read about it. We had a lot of fun doing this and it's sort of just this side of crazy is the way I describe that. So these are the, I've talked mainly about encode. These are the 410 authors of encode. My name is in here somewhere. But in fact, there are a couple of people that really should be picked out. So Ian Dunham worked with me at the EBI and Cunjai worked with a variety of people, but at Stanford with Mike and with Serfyn. And then we just restress the data production leads. I think people like me get a disproportionate amount of credit for this project. These people do not get a disproportionate amount of credit. So if anybody says I was a data production person to encode, say thank you very much. I hear you don't get thank you'd enough. And this is the group. Shelley, Patrick, Carrie, Francis, Chuck, Seth, Jen, Vichy, Raj, Janet, Ryan, Stephen, Bumkey, Flo, Kate, Peter, Alexis, Marta, Noam, Jeremy, Lingon and Nathan. And it was all funded by NHGRI. So thank you very much. Questions? Nope. Okay, so we'll take a break now and we will start again at 3.25.