 Thank you very much for inviting me to speak at this meeting. I'm going to talk about some new web tools for cancer mutation analysis. They're called cravat and Muppet Interactive. First, some motivation. This group may not need this motivation. And we already saw a slide about a wave of data that was coming at us. But here is a snapshot of somatic mutations that were in fire hose at the brode as of the middle of last month. And there are a couple of million of them. I have not included the pancan-8, which was much larger than all of these. Suffice to say that there are a very large number of somatic mutations that we need to analyze, and there's a lot more coming. The goal of the work I'm going to present to you is to provide what I'm going to call an end-to-end mutation analysis workflow. And by this, I mean, this is going to be a web application into which you can submit a very large number of mutations discovered in cancer genome sequencing. Millions of mutations can be submitted to this service. And the service will do a lot for you. It will map them from genomic coordinates onto transcripts. It will identify mutations that have been previously seen in cancer and variants that have been previously seen in population databases. It will identify the type of change, is it missense, nonsense, silent at a splice site, and so forth. And then finally, it will do analysis. It will predict which mutations are most likely to be drivers and which are most likely to be functional. It will also allow you to visualize mutations onto three-dimensional protein structures. And finally, to find significantly mutated genes and pathways. Well, this seems like a bit of a tall order. And the first goal of this application has been to focus only on missense mutations. Now, I realize these are not the only interesting mutations in cancer genomes. However, there are a lot of them. And if you look at this display of the distribution of different kinds of mutations seen in exomes, you can see that the missense really are a large majority of mutations that we're seeing. There are also a very large number of tumor exome sequencing projects that are now in progress or have been completed. So we thought this was a good place to start our analysis. The first tool I'm going to talk about is Crevat. This is a tool for prioritizing missense mutations. And then I'm going to talk about Muppet, which is a visualization tool which allows you to map mutations from genomic coordinates onto protein structures and learn about mutations through interactive visualization. I'll introduce both tools. I'm going to try to work in future plans. And really importantly, I'd like to get your input as a community because these tools are for you. And the goal of these tools is to make them easily used by someone who doesn't have a very extensive computational background and also someone who doesn't have a bioinformatics support team in their lab. The goal is to make these very user friendly. You can tell me how well we're doing so far. Here is the interface of the Crevat server. The URL is at the top. There are three stages, input, analysis, and results. I'm going to quickly walk through what's involved in each stage. First, in input, you input your mutations in genomic coordinates. You can also input transcript coordinates, but we really recommend that you use genomic coordinates because we can do much better annotation if you give us genomic coordinates. In the next step, we map mutations onto what I'm calling the best available transcript for a position. And I don't have enough time to give the details of this. I'm glad to explain later. Briefly, we have a greedy algorithm that selects this best transcript. And it's the best transcript for a particular position. And it's based on the coverage of coding bases of the transcript and also to the extent that it represents a consensus of RefSeq and ensemble transcripts. In the next step, we identify mutations and variants that are already known. And we provide you with allele frequencies from 1,000 genomes, ESP6500. We show the number of times it's been seen in cosmic. And we also show you the primary tumor tissue types in cosmic, in which the mutation has occurred. Next is analysis. We have two main analysis tools that we're offering at this time. The first one is CASM. And this is a supervised machine learning method that attempts to discriminate between driver mutations and passenger mutations. It uses an algorithm called a random forest, which is an ensemble of decision trees. And the trees vote as to whether a particular mutation is a driver or a passenger. You get a score, which is the fraction of trees that voted for the passenger class. Where do these decision rules come from? Well, there are a lot of them. They are bioinformatics based. And we have pre-computed 86 of these features for every position or almost every position in the exome. And thus we can do this analysis very fast, which is our goal since we expect to handle submissions of millions of mutations. You might wonder how we've trained this classifier. This paper was published a few years ago in cancer research. But briefly, we have identified driver mis-sense mutations with the aid of Bert Vogelstein, Ken Kinzler, and Victor Velkulescu, who looked through the cosmic database and have curated mis-sense mutations that they believe are drivers. These are in both tumor suppressors and oncogenes. And then for our second class, since we don't really know what passengers look like and very few people publish papers about discovery of passenger mutations, we do an in-silicose simulation. And we're very careful to try to match the dinucleotide spectrum of a particular tumor type that you're interested in when we generate these random passengers. Importantly, our random passengers do not look like SNPs. OK, so in fact, chasm is not just one classifier. It's many classifiers. There's really one for each tumor tissue type. And you can select a tumor tissue type in cravat. We also have another generic classifier for tissue types that are not yet supported. And we are trying to grow this library of pre-constructed classifiers. Vest is a new tool. This is very similar to chasm. It uses the same algorithm. It uses the same features. But its goal is to identify not driver mutations, but functional mutations. And to do this, we use a different training set. We use about 50,000 missense mutations that are in the human gene mutation database. These are missense mutations which have a functional impact on the protein and which have generated an observable clinical phenotype. And in this case, our negative class is high-frequency SNPs, which we get from ESP6500. So why vest? Why not just use chasm? Well, this is our view of the somatic mutation universe. I have a large circle, all somatic mutations, a subset, those that are functional, and a subset of that, those which are actual driver missense mutations, in the sense that they give a selective advantage to a tumor cell. I would like to suggest that these are two different classification problems. On the left, I have chasm attempting to separate drivers and passengers. And on the right, I have vest attempting to separate functional and benign mutations. And there are other excellent tools that attempt to separate functional and benign, including mutation assessor, SIFT, polyphen. There are many others. However, these are two separate classification problems. I think that they both provide benefits for cancer mutation analysis, and I think these benefits are complementary. And if I had more time to talk, I would talk for about 20 minutes about how these two approaches are complementary to each other. I'm glad to discuss this after this talk. All right, there's a little bit more to our analysis. We provide gene annotations from PubMed and GeneCards. We also provide gene level scores which are combined P values of vest and chasm scores. Right now we're using Stouffer's method to come up with these gene scores. Ah, finally, you get to submit. So this is quite simple. You click a submit button. When your job is done, you get an email and you get a link to a results file, which is a zipped archive. What do your results look like? Well, we would like to give you an Excel spreadsheet with numerous tabs. We also provide tab-delimited text files that are machine-readable. And we offer something else because while we hope to integrate Crevat and Muppet Interactive, right now they are at two separate websites. So to make it easy for you to use Muppet Interactive, we provide you with a formatted file that you can upload into Muppet to map your mutations onto protein tertiary structure. Let's talk about Muppet Interactive. This is a brand new tool. The idea is that you input genomic coordinates of mutations that you're interested in. You can upload a file, you can put them in a text box. Right now we support up to 2,500 at a time and we're trying to make this larger. Still, this is a pretty good number. When you submit, you are immediately returned a table, which is a list of protein structures from the PDB onto which your mutations can be mapped. And you'll see some description of the genes and you will also see available annotations that we've pre-mapped onto these structures for your viewing enjoyment. And you may wanna select the structure you're most interested in based on what's in these annotations. So in this table you can see there are binding sites, there are what are called regions, which are functional regions of interest. There are some results from mutagenesis experiments and these come from the Uniprot Knowledge Base. We hope to add more features as we go forward. So what do you get? You click on one of these structures, you get a page like this, so you get a blow up of all the annotations from Uniprot. You get a little table with your mutations. Recall that this shows multiple mutations and you get a view, an interactive view of your protein structure with your mutations mapped onto it. Those are those little green balls. So why might you wanna do this? Well, we're certainly interested in clumping or clustering of mutations and by simply looking at mutations in primary sequence space, we do miss a lot of mutations that are actually close to each other in three-dimensional space. So this allows you to see your mutations grouped together in three-dimensional space and it also lets you see a three-dimensional view of functional annotations on the protein and how your mutations relate to those. We are using the JMAUL applet, but we have created a layer of tools. We have this new toolbar, our easy-to-use interface, that makes JMAUL a little more user-friendly because JMAUL can be difficult to work with. Okay, so I'm now showing you a result here where we input breast cancer mutations from Firehose and of course, I've cherry-picked a very nice example where we found five mutations that cluster together on a structure of RUN-X1. So you're actually seeing a dimer here. RUN-X1 is not biologically a dimer. That's why you're seeing this in duplicate. This particular PDB structure has RUN-X1 as a dimer. So you see the green balls. You can do more. So you see all the features in the table. You can click on this region button, for example. You can blow up your image by pulling at it. And you can see here that there is an interesting region that in three space pretty much overlaps with the location of these clustered mutations. This is the kind of thing we find very interesting and something you would not see by just looking at primary sequence. So what is in this interesting region? You can click on that letter A in our toolbar and you see this is a region that interacts with DNA. So this becomes even more interesting. We can see these mutations are very likely to be functionally important and we can see why. There's no black box here. All right, it turns out that RUN-X1 has been recently implicated in breast cancer really just in the past two years. Now believed to be a tumor suppressor and significantly mutated as we saw in the last talk. Okay, it looks like I'm running out of time. I have a slide about future functionality. We do have GANT, which is a new statistic to find statistically significantly mutated genes and pathways that we will be adding soon. We're also looking at more exhaustive mapping of mutations onto transcripts and something I'm very excited about coming in 2013, integration with UCSC Cancer Genomics Browser and the Human Gene Mutation Database. Thank you very much for listening. Please come to poster number 75 to talk with me. And I'd also like to tell you that in our colleagues in Barcelona have another analysis pipeline, different but complementary to ours called Intogen SM and they are at poster number 80. And finally, acknowledgements. It's a bit crowded. A lot of people contributed to this and thank you so much for listening. I think we're gonna move to the break. We have time for maybe one quick question. I think everyone wants to rush and check your tool. Everybody wants to ask how you're gonna benchmark this and compare it to the other tools out there, but that's to come. I think that's your point. Yes, if you wanna talk, please come to our poster. I've got more slides and be glad to talk to you in depth about this. Thanks. Let's take a break to a quarter till there's zero incentive to have a long break. There's no coffee as far as I know.