 state annotation using spectral learning. OK, thanks very much. I'm Kevin Chen. And I'm going to tell you about some work I did with my ex postdoc Jimin Song. And the second part of this is also going to be joint work with two computer scientists at UC San Diego, Kamal Khachodri and Chi Cheng Zhang. So I have a very simple goal today, which is to tell you about our software program Spectacle. And it's follow up called Spectacle Tree. And the goal is to try to get more of you interested in trying out our software if you're more interested in the computation aspects of this. Or if you're more interested in the molecular biology aspects to look at our annotations of the ENCODE data, all of which is available from my web page. OK, so the problem we're trying to solve is the same one, essentially, as what is solved by programs such as Chrome HMM and Segway, which you may be familiar with. So for each cell type, you're given a bunch of tracks, which is shown here in the vertical axis. And you get a bunch of, for instance, histone modification marks. And the genome is shown across in horizontal axis. And what you're asked to do in this problem is to return a coloring of the genome or a segmentation of the genome into different what's called chromatin states. As you see in the first row, red, yellow, gray, and so on. And each of those chromatin states is supposed to correspond to some kind of functional biological domain, such as an answer or promoter. So that's the problem. And with a technical level or mathematical level, the tool that's really been used over and over again in the literature is something called the hidden Markov model. And we're doing that as well. So what I'll tell you about quickly is our software, which is called Spectacle. And that's for analyzing one cell type or one tissue type if you're looking at, for instance, the roadmap epigenomics data. That's work we published last year in genome biology. And then some follow-up work where we were able to actually extend this into the joint analysis of multiple cell types, which are related. So if you have multiple tissues, which are all related by some developmental time course, for instance, we published that in NIPS, which is actually machine learning conference. But we're trying to put this into the biological context as well and have that in review. So the main idea to completely technical level, and this is my only sentence about computer science in this talk, is going to be this one, that most methods in computational biology use something called the expectation maximization and work in the maximum likelihood framework. That's typically slow, and it typically only finds a local optimum in the solution space. And so we use something which is very novel and just emerging in the theoretical machine learning literature called spectral learning. And that's something which, if anybody is interested in the technical details, they'd be very happy to talk to you about after this. So I'll just tell you about our results today. Our software, in case you're interested in actually running it, is about 100 times faster at learning the model than Chrome HMM. And I'll tell you the implications of that. It's not just so that your analysis goes faster, but we can actually do more types of interesting biological analysis by having a faster method. So we can look at the accuracy of the solutions produced. This is a depiction of the solved chromatin states. So each line here, each of the 20 states is supposed to represent a chromatin state, such as an enhancer or a promoter or something like that. On the left, you see our solution. On the right, you see the Chrome HMM solution. For comparison, you see that they're pretty similar. And I think that's good, because that shows we're not completely off the deep end here. But what I'd like to draw your attention to is the one which is shown in the box, which is chromatin state 20. And that's where we've learned something interesting. You can see here in this high K4 monomethalation, low K4 trimethylation has high K27 acetylations. So we recognize this is a good enhancer state. And for comparison in the Chrome HMM solution, we see what actually looks like a null state where you have low values in all of the chromatin marks. And we see this repeatedly in many, many different cell types, essentially all the ones we've looked at, where you see many null states. In this case, in Chrome HMM, state 20, 15, and 1. So this is something that, in machine learning, is a known phenomenon. It has to do with overfitting to the background. And it's something that's solved by our method. Here's a follow-up piece of data which shows that this actually gives you biologically relevant results. And we're switching here now to disease snip enrichment, which is shown now by having a deeper red color. So the redder you are in this picture, the better you are. Each column now is a disease snip enhancement for some particular disease, which is relevant to this cell type, which happens to be an immune cell type. The ordering of the states is the same. Spectacle is on the left. Chrome HMM is on the right. And you see here that chromatin state 20, which is the enhancer which we found, which was not found by Chrome HMM, is actually the one which has the highest enrichment over all states. I'd like to stress that, not comparing to the other state 20 in Chrome HMM, but compared to all states in Chrome HMM, the one with the highest disease snip enrichment, sorry, is this new enhancer state. OK, so that's one example. Let me switch now to telling you about our follow-up, which is called spectacle tree. And the main thing here is that we're now able to, for the first time, very efficiently analyze multiple cell types together. Let me just explain this model quickly. The important thing here is, oh, I'm sorry. Is there a pointer? OK, so the important thing here is the tree of three things. Is there a pointer? Yes, it's OK. Is these three things here? And that's supposed to be three different cell or tissue types. And the tree is supposed to represent something like the developmental lineage. So if you have, for instance, something very simple in roadmap could be fetal brain and two adult brain regions. And I'm showing you here just the developmental lineage of three tissue or cell types. But this can be done for as many cell types or as many species as you have. And then over here on the horizontal axis is the regular hidden mark of models. So this is depicting all the positions on the genome. So this is the same model as before. It's just that now you're allowed to have a large number of tissue types, which you jointly analyze. So our machine learning paper is about how to solve this model in a spectral way. And we have done that very efficiently. I'll just tell you the results again. They take the same form as for the single cell type case. We're faster. We find a bunch more biologically significant states as opposed to the bachronel chromatin states. We have higher prediction accuracy in terms of precision recall accuracy on known biological states. And I see I'm running out of time. So let me just make the point clearly that if you are going to use something like Comaching or Segway, I encourage you to take a look at our software and our annotations. And they're both available on my web page. And please email me if you have any questions. Thank you very much.