 Move to our next speaker, Sean Eddy of the Howard Hughes Medical Institute in Genilla Farms is here to talk about reading genomes bit by bit. Sean. So it's an honor to be here and it's been a great ride throughout this project. I entered graduate school when Renato Delbeco wrote that article about the human genome project and why we should do it and I've been able to see this from the beginning. I want to convey to you some of the excitement that I feel as we get to look at all of this stuff and immerse ourselves in the AC's, G's and T's and also I want to talk a little bit about some of the biology besides medicine and besides the human genome and in the biology parts of the stuff I'm going to talk about. I'm also going to talk about some geekly things about computers and software but in the biology parts I'm going to try to explain two things. I'm going to try to explain why not only did we sequence the human genome but we sequenced a bunch of other things including lots of flies and lots of worms and lots of other things not just melanogaster but all around melanogaster and also things like why did we sequence this particular single cell pond protozoa and other things like it and I'll have a little bit about that. Where I'm going to start though is at the end of the day when we look at a DNA sequence at least if you're a computational geek you look at the sequence and you see a bunch of AC's, G's and T's and you say this is a symbolic sequence. We know a lot about cracking codes by both experiment and statistical analysis. Figuring out the meaning of apparently impenetrable languages is something that humans have done for a long time and it's a great intellectual thing to do. Puzzles are based on this. It gets very addictive. One of the great examples of this is actually described in a book called The Decipherment of Linear B by John Chadwick writing about his buddy Michael Ventress, a mathematician at Cambridge. This is an apropos tale. It is 1953 in Cambridge. One code was cracked in the pub and then another code was also cracked by Ventress. This is Linear B a language that had been found in Mycenaean caves and was only known in symbols. It's one of the few examples where we didn't know anything about the language it had to be cracked by direct statistical attack. This attack was comparative sequence analysis and the idea was to look for statistical regularities between different tablets and as far as I'm concerned this is one of the early successes of comparative sequence analysis. In this case Linear B turns out to be an alternative script for Greek and so the problem becomes somewhat easier than the problem that we're faced with. But we have a lot of examples and it's fairly mind blowing to me still to go to our disks and what you're seeing on the left unreadably is a listing of the top level of the disks at Janelia Farm where we keep our genomes. And this is like some pre-Victorian phylogenetic tree because you can sort of see algae, amoeba, amphibians, archaea, bacteria in an order that's sort of percolating up and down depending on what my laboratory and I and my wife are working on at any given time. And it's remarkable to me that sort of sitting in these directories is the actual source code to all these different wonderful creatures that we see. There is a lot of talk about how much data we're talking about. It is a fair amount of data but it's not at least at the moment completely mind blowing. For my lab that disk that I just showed you is about 450 gigs. We have most of the genome sequences that are available. If we were in the business of manipulating human data, if we were taking raw images off an Illumina, that's a lot of data per human genome and that's difficult to store, it's difficult to ship down the internet. But once it gets to an assembly, it's not so bad to store it per year. It's not so bad to transmit it down the internet. And as Eric just mentioned, there's proof of principle that we can store human genomes as differences. And so we know sort of intellectually, sort of in principle, we know that we can handle the genome for the next couple years but it's non-trivial. So the 1000 Genomes project has generated five terabytes in its pilot project. That's not a big deal. We have a petabyte of spinning disk at Genelio. We could store that if we needed to. The NCBI short read archive is starting to fall over. It's going to be approaching a petabyte soon. But these are volumes of data right now that are not too different than what's, you know, your iTunes collection or I've got this up. This is sort of like the Vikings used to drink, you know, beer from the skulls of their enemies. My coffee coaster is the Solera genome sequence. When we look at this sequence, this is one of the few genome sequences that fits on a slide. This is FIX174, sequenced by Sandron Coulson back in the 70s. This sequence, this little bacterial virus, is actually quite interpretable because of course there are statistical regularities. There's an ATG start. There's one of the three stop codons. There's an open reading frame. We can walk through a genome like this because of that bacterial genome. It tends to be packed with open reading frames. And we can do a pretty good job of interpreting this genome. There's interesting stuff going on in the gaps. And sometimes there's an interesting little RNA. I'm quite interested in little RNAs that have functions, as Eric was also talking about. Once we get into the human genome and the big vertebrate genomes, we have a bit of a problem. Only one, two, three percent of it is coding, depending on who you're talking to. And one of the most important signals that we can try to use is sequence conservation. We line up a bunch of genomes and we say this set of bases is essentially the same across some clade of evolution. Really key in our ability to do this kind of analysis with the human genome was the development of genome browsers by Jim Kent and David Hausler, the ensemble browser by Ewan Burney, by having the ability to align large quantities of genome, which has been done by Webb Miller's group, among others, and then the ability to take those genome alignments and calculate some simple statistic of how conserved do I think the various bits are, generating these plots that you can sort of see in blue that spike out, of course, on the exons, in this case of a single gene, the p53 gene of the human, and then also spiking out on some other areas of non-coding conservation. Adam Seples' program, Fastcons, underlies a lot of the calculations that people use for getting a quick look at this conservation. Those little spikes outside exons represent unexpected, well, we expected there to be regulatory sequence. We know there's transcriptional regulatory stuff. We know there's enhancers, there's promoters, there's what have you. But now we can use the conservation to really tell us where that stuff is likely to be. And so that's really the answer to the first sort of biological thing that I said. Why did we sequence so many flies? Why did we sequence so many worms? You can sit down and you can do a power calculation and you can say, okay, what I'm really trying to do is I'm trying to count the number, here's a region of DNA. I've got the human, I've got the mouse. I'm expecting about 40% difference between those two sequences by just neutral drift, by evolutionary clock. So if it's 100 nucleotides, I expect 40 differences. I only see 10. Should I be surprised? 10 is a small number, 40 is a small number. You can do a power calculation of for a given amount of conservation, how much do you have to drive it to distributions apart before you can reliably tell that you've got a conserved piece of DNA, given the size of the human genome, which is quite big, so you're gonna have false positives. Greg Cooper, Aaron Sidao, and others, and then followed by me, have done those kind of simulations and power studies. And if you wanna drive the resolution down to single nucleotides or five or 10 nucleotides, you can pretty quickly convince yourself that for typical distances between vertebrate genomes of like 0.2 to 0.4 substitutions per site, you're gonna need tens to even hundreds of genomes lined up. So it's not so much that we're interested in the platypus per se, but it's one of the 100 genomes or the 1000 genomes that we're gonna try to line up against the human to try to figure out what's functional in the human. And not just, so that's a very crude calculation. It's conserved, must be doing something interesting. What you can also do is you can look at the pattern of conservation and now you can do much more interesting things. For instance, if you're dealing with a coding region, obviously, there's a pattern of conservation that tends to respect the triplet periodicity of the genetic code. So you can do something like take a region that seems to be conserved. In this case, this is a poster child for a long intergenic non-coding RNA, a gene called SRI1, a gene that when it was first cloned had a truncated CDNA that went to this point and you'll notice the ATG start code on there. And then you color in where all the mutations are and you color third position changes red and then first position's green, second position's blue. And you can see most of the changes here are in red. They're respecting the frame. There's two insertions in the picture that I pulled out. One of six, one of three, again, respecting frame. And when you do this over the entire aligned regions of the SRI1 non-coding RNA, it's pretty clear that this is a coding gene of 232 amino acids in the mouse and 236 amino acids in the human. Now, when the first paper came out, this is 1999. They didn't have as much data as we've got now. And this, if we went deeper into the SRI1 story, it gets more and more murky because it does look like the RNA has some function independent of the coding region, but that's a different story. This ability to recognize coding regions and discriminate them from other conservation is something that's now at the heart of a lot of computational gene finding methods that are trying to harness the vertebrate sequences that we have available or the drosophila sequences that we have available or the senorebditis sequences, work that was really pioneered in bacteria by Jonathan Badger, a graduate student in Gary Olson's lab, but is now sort of fundamental to the field. Now, not just coding regions impose their own evolutionary constraints on sequence, but now once you have that idea, you can say, oh, I expect transcription factors because of the way they contact DNA to show particular patterns of conserved bases and then the middle would not be so conserved because of the way they sort of reach out across and leave a little gap in their conserved binding sites. Or in the case of my laboratory interested in RNAs, you can say there should also be a constraint imposed by RNA secondary structure, at least for structured RNAs. So it's that I could imagine making a statistical test for you give me a aligned pair of sequences. I can do a statistical test for whether the pattern of changes here showing four changes is randomly distributed. Sort of every column is independent or whether the four positions are respecting frame. They tend to be in the third position or they tend to be respecting Watson-Crick base pairs, preserving Watson-Crick base pairs in a correlated fashion, which is a feature that we see in lots of structure RNAs. You can formalize this and I'll talk in a few minutes about how we formalize this. There's now a sort of Lego box of tools that we use computationally to build this kind of statistical test for sequence analysis. And you can turn this kind of approach into an RNA gene finder, something that will look now for structural RNAs in conserved regions of whatever genome you're looking at. And it's been very successful in bacteria. Signal noise makes it very problematic in the bigger genomes. But there's been a lot of great work from Ival Hoffacker and Jacob Peterson and other people in taking this basic approach, which was developed by my wife Elena and actually get it to scale to the big genomes. And you can, one of the things I love about this field is that you can find little subtle effects sort of the way a hacker will say, well, if I heard you typing on a keyboard, the pattern of spaces between your pauses is informative about what you're typing and so I can figure out your password if I've got a microphone close to you. We can do the same kind of thing for genome sequences. We can look for very subtle things that evolution is putting on the sequence. We can detect those patterns. This is just one of many examples I could have pulled. This trick only really works in bacterial genomes. I wish it worked elsewhere. The graph is I'm sweeping across four million bases of E. coli and I'm gonna do a very simple thing. I'm gonna count the number of Gs I see and I'm gonna count the number of Cs I see. And as I just go across the top strand, the Watson strand and I'm gonna plot the excess Gs and what you get is you get that plot. That's very non-random. It's not much. There's an excess of about 20,000 Gs as you go across and then that excess goes away and then it starts coming back. Turns out that's the terminator and that's the origin of replication. And remember that E. coli replicates like this. And what happens is there, it's not actually understood why this is but one of the models is that the lagging strand replication is more solvent exposed and more prone to deamination of the C because that's a water-driven deamination. And so you get a depletion of C on the lagging strand and that pattern shows up. If you believed that, you'd expect that to also show up for transcription. It happens that E. coli also, its transcription direction respects its replication direction which is probably leading to why the signal is so clean in E. coli. Phil Green, Arian Smith and others try to turn this idea into an RNA gene finder. Doesn't work very well with single genomes or pairs of genomes but it might be that there's enough signal in the mutational biases of transcribed regions that we can use this to find things that have not yet shown up in RNA-seq or epigenome experiments. Things where evolution knows that the thing is being transcribed at some point in some single cell that would be difficult to measure experimentally. Now, I also showed you a little picture of a pond ciliate. So now let me get to that part. Once you're looking for little patterns, you don't have to look for just little patterns in all organisms that you think E. coli shares with humans, shares with the worm. You can now say, you know, I can exploit the fact that here's this weirdo creature in some weirdo evolutionary niche that has a weirdo evolutionary pressure. I'm interested in RNAs. I can take advantage of this little weird creature to find RNAs and then use homology search to get out of the little weird creature and find homologs in other organisms. So here's an example of the kind of thing you can do once you have tons of genome sequences lying on your disk. You can notice that a great sort of observational computational biologist, Gene Lowbury, has published a paper where he says, well, I looked at a bunch of bacterial genomes and I noticed something interesting. If you plot the optimal growth temperature of the organism versus the GC content of the genome, I sort of expected that the higher the temperature, the more GC rich the genome would get. But it doesn't happen that way. Bacteria and archaea that grow in high temperatures have other evolutionary adaptations to high temperature other than just strengthening the hydrogen bonds in their DNA. They hold their DNA together by making a reverse gyraase that overwines the DNA, burns ATP and puts positive supercoil into the DNA and other tricks to stabilize their DNA. But Lowbury looked at the GC content of structural RNAs, structural RNAs go off they make a little single strand, it folds up. Now to stabilize the structural RNA at high temperature, evolution does drive the GC content. And so GC content of the genome is sort of randomly varying but the structural RNAs tend to be tightly correlated. So if you're an RNA guy, you go, oh, okay. What I'm gonna do is I'm gonna reach into the database, find me the most AT biased genome that still grows at a super high temperature. The most extreme one you can find is pyrocaucas furiosis. It's isolated from the Macaroon Volcano Island, Italy. I actually know some of the people who have the benefit of being able to go in their scuba gear and collect on the Italian islands. This thing grows at 98 degrees C, 60% AT genome. It is not something you would think of as an experimental organism, dies in trace oxygen. It's a strict anaerobic, grows on elemental sulfur, generates hydrogen sulfide. Nobody likes you in the floor that you're working on. It does have the advantage that because it's normal growth temperatures, 98 C, you don't need minus 70 to store your samples. You just put it on the bench and it thinks that's minus 70. But what you can do with its genome, so the easy part is looking at its genome sequence because that was done by not us. We could just reach into the databases and then say, okay, this is a pearl script sweeping across its genome, just counting GCs in windows and there's two tRNAs. So finding structural RNAs in this organism is completely trivial and when you do this, you find a bunch of little RNAs, 10s of structural RNAs that had not been discovered before, all of which were in known classes. So this was unable to discover novel classes of structural RNAs for bacteria or archaea. Then you say, okay, one of our big problems in interpreting genomes is that we don't know where a gene stops and a gene starts. So when I'm looking at cis-regulatory stuff, I don't know when the enhancer is for this gene or I don't know when it's for that gene or indeed whether it's transvecting over to some other chromosome. And if I'm looking for structural RNAs, I have a problem of just finding them in the first place because statistical signals are pretty subtle. So this is some ongoing work from a student in my lab, Silkyung Jung, working in collaboration with Laura Landweber's lab at Princeton, working with an organism that was pioneered by David Prescott at the University of Colorado Boulder and I was a graduate student there and I've always held this organism in my head as this organism is gonna be useful for something someday. Its adaptation is a bizarre one. It has in its macronucleus, it actually has two different kinds of nuclei and I won't go into all the biology, has a macronucleus that's a somatic transcribed nucleus and in the macronucleus, it has about 20,000 different chromosomes. There's actually several million chromosomes. The average chromosome size is two kilobases and these are individual chromosomes, telomere, gene, telomere, telomere, gene, telomere, telomere, gene, telomere, extra gene. For the most part, not completely, unfortunately, but for the most part this organism has identified all of its genes for us and put telomeres at the ends of them and so now we can just sequence the gene, I'm gonna say that's a gene, that's a gene, that's a gene and since protein genes are relatively obvious, not completely but relatively obvious, we can do a subtractive screen, just throw away all the proteins who cares. Everything else is either an interesting gene that we didn't know about or a structural RNA or something like that and that screen has gone and has identified a small number again of new RNAs. So let me close with a couple words about deeply things, forgive me, but underlying all of this is computer science, statistics, mathematics, it's being used to interpret these genome sequences and at the end of the day, what we're trying to do is we start with a sequence and we draw these cartoons where we say, okay, the sequence is aligned and we're trying to attach labels. I could show you a protein sequence and these could be domains in this case of the Dicer protein which was actually done by sequence analysis, Brenda Bass wrote a review where she said, okay, based on what we know about RNA interference, it's gotta have an RNA helicase activity, it's gotta have this, it's gotta have that and it turns out there's only one protein in C. elegans that has the right combination of domains of known function and that protein turned out to be Dicer and so that was an important computational clue in the early days of RNA interference, but I could be drawing a DNA sequence and enhancers, you know, this is how we draw things, exons, introns, we're trying to attach labels to sequences, that is, you show me a piece of sequence, what I'm really interested in, what label do I attach to that piece of sequence? The label is hidden and I'm trying to infer, should it be this label or that label? And it turns out that there's a big field of mathematics in digital signal processing and speech recognition that does this for whether it's a speech signal or whether it's an encrypted signal or whether it's the telemetry off of your car engine, which actually, I could tell you a story about how those guys are now using software developed in bioinformatics to do prototyping of engine telemetry and astronomical telemetry, which is amazing, the field has sort of gone full circle because we've put so much work now into adopting methods called hidden Markov models and stochastic context free grammars and other methods, probability models of attaching labels to sequences that are appropriate either for linear sequence analysis where you just say, okay, I'm just gonna do a sequence alignment or in the case of RNA, I'm gonna align pairs in a sort of nested sets. We have models to do that, they were introduced in the 90s by Gary Churchill, Gary Stormo, Anders Krogh, David Haussler, SCFG models for RNAs that's introduced by Yasu Sakakibara and myself at about the same time. Yasu was in David's lab at the time. And this has given us a real toolkit to build models, which gives me an opportunity to say that the most important tool in computational biology without doubt is the tool called BLAST and it's probably familiar to all of you. BLAST does a sequence alignment between a query that you give it in every sequence in the database and it looks for things that are significantly related. When we look at the BLAST algorithm as probabilistic modelers, we say, okay, this is an approximation, it's a two-level approximation, well, it's a multi-level approximation. It's doing sequence alignment where what it's trying to do is attach a label saying these two residues are aligned or these two residues are not aligned. This is an insertion, this is a deletion and so it has three states. It says I'm either gonna try to align residues X and Y or I'm gonna throw X out as an insertion or I'm gonna throw Y out as a deletion, so as an insertion on the other strand. Those are three states that I can move between. That's a Markov model. Where I go next is dependent on where I just was and so I have hidden states that I'm trying to infer connected by arrows, that's a hidden Markov model but the arrows in BLAST are associated with mostly zeros implicitly and then there's a gap open penalty every time I open an insertion, either way and there's a gap extend penalty every time I extend an insertion by another residue on either strand and then there's a score for aligning the two residues that's due to the Blossom matrix from the Hanukkahs. The bottom line is we now understand that those zeros really should be probabilities or at least if you're thinking in this sort of probabilistic inference context and we can represent BLAST's internal model as a probability model and now we can do things like, well instead of just giving me the optimal alignment, some overall possible alignments and then tell me what's the probability that you're really confident that this residue aligns to this residue integrated overall possible alignments and other things. The field has been trying to do that for a while and now I wanna make a somewhat sociological point. There are lots of research and a big literature on developing better methods for sequence analysis. Those methods going to journals like BMC Bioinformatics and what have you and none of you read them. Then there's another field which takes very important algorithms and then reduces them down to their bare bones and speeds them up on particular hardware and there's lots you can do on modern hardware. For instance, this paper from Michael Ferrar who was until his untimely death a couple months ago, unfortunately, our chief software engineer. I recruited him from Boston. Notice, unaffiliated, this is also a sociological comment. He was not in a university, he did this particular paper in his spare time and then was recruited by both Bill Pearson and me to actually become a biologist. So you can use SIMD, what's called single instruction multiple data being driven by the graphics industry. All modern chips are capable of vector paralyzed processing. They're being driven to this by all the games everybody plays. You can take advantage of those instruction sets to make Bioinformatics software but the question is, who's gonna do it? The difference between writing a piece of software that works for your BMC Bioinformatics paper and a piece of software that runs fast and it can be used by the rest of the community is an enormous difference and it's to blast credit that one of the things that's underlined that is not only terrific theory from Steven Altshule and Sam Carlin and others and great algorithmics from Gene Myers and Warren Gish and others but terrific software engineering from the NCBI team. It's very rare to get that kind of investment in a piece of software and that's really one of the things that makes blast fast. Frankly, in our lab, we are frustrated that blast was written 20 years ago and it's now difficult to adapt it to the stuff that we think we know with probability modeling now. There's an effort in my lab to now take the hidden markup model methods and speed them up to blast and we've been investing a lot of time in this to get the engineering up to snuff. One of our goals is for you to be able to do an HMM-based search like a blast search on a web server where you get the response no matter what the size of the NR database in 100 milliseconds or less. That is faster than a Google search so that you can do interactive searching rather than waiting for a batch job and actually start exploring what sequence space looks like in all these wonderful organisms and we're within an order of magnitude of being able to do that. Again, reinforcing the point, I don't want to belabor this but we have been working to engineer tools that can do this kind of probability analysis. The point I want to make on this slide is the difference between a tool that runs good enough for a publication. I could probably write hammer in 1,000 lines of C code. The actual code which is not as good as we want is 44,000 lines and similarly so we maintain a big code base trying to make this stuff useful and that requires engineering. But it also means that I have a two-faced view of what we're looking at and it's sort of not great for one's mental health at times but we are very much immersed in two levels of code, one the level of looking at all these wonderful genome sequences but also the level of our C code of trying to interpret all that and both of those are evolutionary artifacts that are difficult to understand. And with that I'll stop and in contrast, and now at sort of counterpoint to Eric's world, my little lab, which is a husband and wife team, we've been working together for longer than we've been together as a couple. Very small laboratory at Janelia Farm where we're pretty much dedicated to building these kind of tools for the community to use and I'll stop and I'd be happy to take questions. We have time for a quick question if someone's gonna race to a microphone. Is that someone racing to a microphone or racing out the door, racing out the door? Okay, so in that case Sean will be available at the break. We're gonna take a break now then. I'm sure he'll talk to you if you have any questions.