 And thank you for the invitation to speak here today. Let's see. The only thing I need is... This is a pointer, right? So let me figure this out. And then the other thing I need, yeah, there we go, is a sense of how time passes. Is that a clock you have there? It's behind my computer. Ah, great. Okay. All right. Well, I'm going to talk today about a project that we have at DARPA on having machines try to understand cancer pathways. I should tell you I'm not a biologist. I'm not a clinician. I'm not an omics researcher of any kind. Last course I took in college in life sciences was a long time ago. But I do think that the universal ideas of mechanism, causality, modularity, hierarchy, that underlie computer science also underlie systems biology. I've never found two fields that are as closely... have such a close affinity as computer science and biology. And so I'm thoroughly enjoying this project. And on behalf of all of the participants and you'll see some of them are computer scientists and some of them are biologists, I'd like to thank you for the opportunity to speak. I also have to put this slide up. You'll forgive me. All right. Last October, government back treasury bills lost seven standard deviations in their value in about two minutes. And two minutes later they recovered their value. This was caused by a flash crash. And flash crashes are caused by automated trading systems that we don't really understand. In a lot of important domains we have what we might call component level understanding, but not system level understanding. We understand, for example, that iron will cause algae to bloom and that when the algae die they sink to the ocean floor and they take carbon dioxide with them. So with this component level understanding of how algae work, we're proposing to dump a lot of iron filings into the ocean to control the climate. What will happen? We don't know. We lack a system level understanding. For these reasons to try to develop technology to provide system level understanding we created the big mechanism program, DARPA, the goal of which is technology to understand complicated systems. And we're surrounded by complicated systems that we don't understand well. They include not only the climate and ecosystems and socio-economic systems and energy systems and food systems, but cancer. So fortuitously we decided to work on signaling pathways in cancer and we bumped into Frank McCormack. We were introduced by a mutual friend and he said 10 months ago now, why not work on RAS? And we said, what's RAS? And we got started. RAS is now the focus of the big mechanism program and what I'll do in the next few minutes is tell you about some of our progress. Pretty much all of the systems that are being developed in big mechanism have this architecture. The systems read journal articles. Build up models of what's being described in the journal article. They suggest revisions to models. Those revisions are processed and revised models are produced. There's some model management happening. And then downstream we do some reasoning. And the reasoning may be about what causes what, where to apply pressure to a system to make it change its behavior. So everyone's trying to build this kind of system. But in 10 months we've sort of gone from naivete to realism. I'm going to tell you about that. Not because I think we're on the wrong track, but because in any program like this any time we try and do something like this the challenges are much more interesting than the successes. Your cancer biologist, you certainly understand that. Let me start with reading. You don't recognize how difficult reading is because you do it all the time and you do it really without thinking about it. We all do. But for computers, reading is really a problem. Consider the sentence. RAS acts as a molecular switch that's activated upon GTP load and deactivated upon hydrolysis of GTP to GDP. You read it really without giving it a second thought. But a machine has to figure out what the entities are. RAS, GDP, GDP. It has to associate those entities with the standard ontologies. They recognize activation loading and hydrolysis as events. But importantly it has to understand that some events and entities are actually classes of things or abstractions. They're not particular precise things. They may not resolve to a gene ontology identifier. They're abstract classes. Then, if you get that far, a system has to understand what are called co-references. Look at the second dimension of GTP here. Is this the same thing as this thing? Well, you know it is. But how's a machine supposed to know that? And what if it were called something different? Then if you know what the entities are and you can sort of track them through the sentence, you have to put them in their appropriate roles in events. You have to recognize that the subject of activation is this molecular switch which itself refers to RAS. And you have to recognize that loading here of RAS, not the loading of GTP. It's the loading of GTP into RAS. Again, you do this without thinking about it. It's hard for computers. If you can do it for one sentence, then you have to read a whole paragraph or a whole paper and you have to be able to tell a causal story about what happened. I thought naively that everything that biologists talk about is represented in a biological database or ontology somewhere. Gene ontology and pathway commons collectively contain about 2 million proteins. So our team at USC looked at 755 sentences from papers, found 541 proteins, looked them up in gene ontology and pathway commons and only found half of them. We thought, well, okay, maybe the problem is that there are sort of spelling errors. So they built a fancy pattern matching algorithm and only found quarters of them. Many of the things that appear in papers simply don't appear in the databases. When you do find something in the database you typically find too many. So I looked up RAS yesterday in gene ontology. There are 91 entries. There are almost 17,000 entries of things that are associated with RAS. How is a machine supposed to know who are the players in a story? It's not a trivial problem by any means. Another problem is that we use different words to talk about the same thing. Here we see the volume and weight of a tumor right here. Volume and weight is in orange. That's the second reference to the volume and weight of that tumor. Right at the end of the paragraph and it's the word size. They mean the same thing. How is a machine supposed to know that? And you can go through this entire paragraph and you realize that we refer to this thing here four different ways. One, two, three, four different ways within one paragraph. Continue with the example. Occasionally the same thing will be referred to in exactly the same way. PI3K is referred to in exactly the same way. But this is really the exception, not the rule. This is called the co-reference problem knowing that one thing has to be identified by different names. I was really astonished by the scale of the co-reference problem. Larry Hunter at University of Colorado Medical Center hand annotated 97 papers. What he found was that there were over 20,000 what we'll call identity chains in those paper. And an identity chain is successive references to the same entity as I just showed you. These are four successive references to the same entity. In 84 papers out of that corpus, the mean number of identity chains per paper was 245. This is worse than Tolstoy. You have 245 characters in your novel and they are referred to on average 12 times. How are you supposed to reconstruct the story? It's difficult with Tolstoy. Imagine a machine trying to do it. In fact, named entities and co-references are only two of the problems involved in reading. What you've seen happening in the reading community is the development of many, many technologies for machine reading. They're sort of always put into a great big pipeline. We have a document reader, a sentence splitter, a JNIA tagger named Entity Recognition. Down here we refer to Kebby. Here we do some more tagging. What have we got here? Function words. Lots and lots of things that find function words. Here we find overlapping annotations. We do some parsing. This is from the University of Manchester from the National Centre for Text Mining in the UK. So we build these huge systems that do a lot of different kind of processing because reading is hard. But the vast majority of reading systems do not read the way that you and I do. When you read Angry Whopper All Fountain Drinks 98 Busses Welcome, you actually have something in mind. Humans read with something in mind. You can call it context if you like. You can call it a model. Roughly the same thing. Which of these two pictures do you think this is about? This one or this one? You have something in mind. When you see this, it evokes some context or some model. It's the fast food model. It's not like you just evoke that model and then the knowledge pours into your head. You actually have to think about what's being said. Are buses really welcome? No. People on buses are welcome. The buses aren't going to eat a damn thing. So when you process language you do an enormous amount of inferential work and you do it with a model in mind. And that's the kind of thing that we want reading in the big mechanism program to do. So here's how we're doing it. We are providing the readers with a huge model in the Biopax representation language and with a stack of texts. And then they're going to read the text and they're going to answer two questions. Just two. What does the text claim about this seems to have lost my microphone. What does the text claim about the prior model? Particularly does it corroborate, contradict, or extend the model in some way? And where in the text is the evidence for that claim? Those are the two questions that we want to ask about this model given these texts. Now, let me just get myself organized. So for example this is a paper from Nucleic Acids Research it claims something that we didn't know before that is it wasn't in the prior model. It claims an extension to the prior model. Namely that the protein I don't know how to pronounce this HUR translocates from the Nucleus to the cytoplasm. That's something in the paper that wasn't in the model. And this index card, we call these index cards was constructed by hand by scientists at the MITRE Corporation as a gold standard for an evaluation this summer in which machines will work with a model that has about 20,000 entities in it and will read about a thousand papers. So it can read a thousand papers, the machines are going to read a thousand papers and they're going to construct index cards like this by reading against this model. Now the question of course is whether that's going to work. And so let me show you some prior evidence. This is what we were doing in December. In December we asked whether the machines could recognize the entities in biological texts. And this is precision and recall. For those of you who aren't familiar with precision and recall, recall is sensitivity and precision is one minus specificity. So this is an error rate. Anyway, so we asked two questions. We said first can you identify that something is a biological entity that's these blue triangles and then we said can you identify which biological entity it is by pointing into one of these ontologies. And we asked Harvard trained biologists the same question. And the machines are doing the best of the machines. It's sort of doing almost as well as a Harvard trained biologist. That was in December. We also asked whether the machines could identify events in the text. So here's an event. The expected increase in IRC activity in the NRAS mutant cells. So we want to see whether the machine can recognize that this is actually an event and what its participants are. Again, here's the biologist. Here's the best of the systems. Doing almost as well. Where machines didn't do anywhere near as well as biologists is on this new task of saying what the text says about a prior model. Machines have never been asked to do that before. And so as you can see performance isn't very good. Interestingly the biologists also couldn't agree. There's about 50% overlap between those two biologists on what they think the text says about the model. But that's a different story. So in the meantime in the last few months the reading systems have gotten a lot better. Go back to that paper I was telling you about the translocation of the protein HUR. This is the output of one of the more sophisticated semantically deep parsers developed by James Allen and his team at Rochester. By the way there's a URL here. Tiny URL Alan Trips. You should try it out. It's really fun. It does a good job parsing sentences. Anyway, here's the parse tree for the first part of that sentence. HUR is normally a nuclear protein. And it appears to be correct. That is to say it's telling me something. It's about something being that's the is here. It recognizes that it's about something being normally happening. It recognizes that it's about a protein. That the protein is the subject of the be. But here's where it starts to get troublesome. It says that being nuclear is a property of the protein. That's actually not what the sentence says. What the sentence says is the location of the protein is normally the nucleus. And so in a sense it's got it wrong. In another sense it's got it right. There are such things called nuclear proteins. It just doesn't know that nuclear proteins live in the nucleus. Also keep track of this. It understands that this is your protein. This is the thing we're talking about. Now we go on to the second part of the sentence. And the second part of the sentence is but it can translocate to the cytoplasm on stress. What's it? The parser fails to recognize that it's that protein. It's now it not referring back to HUR. And something else interesting happens. It says let's see what it's got right. First of all it does recognize that it's about translocation. It does recognize that a lot of parsers cannot. It recognizes that there's a modality to this sentence. It doesn't say it does translocate. It says it can translocate and that can make all the difference when interpreting experimental results. So it does those things right but look down here. This is sort of interesting. Translocate to a location which is the cytoplasm Wait a minute. We have a modifier for cytoplasm. This phrase is parsed as the cytoplasm on stress. As if to say the cytoplasm on your slide or my brain on drugs. It's not right. It's not saying that stress is the cause of the translocation of the cytoplasm. So it gets some things right and it gets some things wrong. And the reason I'm telling you this is that I see this going on pretty much indefinitely. Machines will do an okay job reading but to do a really good job they have to know something about the domain. And that can be quite taxing. Alright. That brings me to my penultimate topic which is how to represent biological models and the evidence so the machines can reason about these models and update them when the evidence becomes available. There are certainly differences between the representation languages we use for biological models. We use Bell, Biopax, PiSB, Kappa, variety of things, pathway logic. But all of them are basically if-then rules. All of them encode basically if-then rules. Here's in Bell the statement in Bell here's the if part, here's the then part. This really says RAS ASPB 2 complex increases transcriptional activity of P53. There's another if-then rule in Kappa. It basically says if you have a molecule that looks like this and a molecule that looks like this then they can bind to one another. That's what this rule says. So they're all sort of if-then rules of one kind or another. I will skip over that. So then the question is if models are just collections of if-then rules how does reading update this? And I'm going to sketch this very quickly. Here is if you like an extremely reduced model of how H-RAS is activated by these things up here. Now imagine that this is your prior model. This is what the machine knows and it reads H-RAS association with GTP was undetectable when PIK3 was substituted with mutant variant. Clearly PI3K is having some effect. It needs to play a role in this model. The question is where? Should you update this rule here? Should you update this one? This one? You've got a model. What is the text telling you about the model? Well the algorithm for doing this is actually pretty complicated and it's Bayesian but I can sketch the intuition pretty easily. The intuition is that you want to modify as few rules as possible and you want to modify them so that they look more like other rules that have the same sorts of antecedents as the rules you're looking at modifying. So that's a bit of a mouthful. So let me give you an example. Rule 1 has EGF and EGFR as antecedents. There are very few such rules in the database that also have PI3K as an antecedent. Whereas rule 529-1 has antecedents, GAB1, GAF2 and many other rules that have that as an antecedent also have PI3K as an antecedent. So this is a more likely rule to modify. And in fact when you modify this rule and this one you do a bunch of calculation of posterior probabilities and it turns out it sort of gives you the right thing. Now will this work forever? I mean is this simple way of doing updating going to space for us to build really complicated models of cancer and what about when we propose such an update, don't you want to go back to the lab to see whether it's right? Well yes, exactly you do and so all modifications to models will be provisional. And there will be lots of modifications to models because we're going to be reading thousands of papers. So that leads to a really interesting sort of new area of science which is how to manage thousands of models. Each of which has some sort of family resemblance to each other but is in some ways different. To my knowledge nobody has actually tackled this problem before. If you want to have millions of models of nature out there being maintained by machines on behalf of human researchers how do you organize them all? How do you search them? Who owns them? So this is a pretty interesting challenge and it's one of a number of sort of high level challenges that we face as Big Mechanism Project goes forward. Because I think that the real contribution of Big Mechanism is to change how knowledge is organized and communicated. Today we work under a thoroughly medieval model. You do and I do. I'm going to call this the pull model. It's a medieval model in which we go into our monastic cells nowadays they have a computer in them but we go into our monastic cells and we pull knowledge into our heads and we try to synthesize that knowledge in our heads. Well there's a bit of a problem with that. One is that you don't read any faster now than you did when you were an undergraduate or the monks did 400 years ago and yet the amount of knowledge is increasing exponentially. Which means you're getting less and less of what's relevant. This is an unsustainable model. There are 20,000 proteins within two hops of rats. There's no way you can understand everything that's going on even if you read all day all the time. You need help, I need help. But the medieval model is also enormously wasteful. Because it's so hard to read at the scale that we need to, the vast majority of things that are published aren't read. And yet we pay for them. What we need instead is what I'm going to call a push model. In the push model everything that's published is immediately consumed by a computer and its implications that is what it tells us about what we know are immediately calculated. When somebody publishes a paper it goes into a big mechanism that's maintained by machines. And in that big mechanism we calculate the ramifications. And that's what human scholars study. Not the papers but the mechanisms that are built by machine. And if this idea works it's going to be the first demonstration of a completely new kind of scholarship where instead of pulling things into our heads we push them out to the machines and I think that could change scholarship profoundly. Thank you very much. Do I have time for a couple of questions? Thank you. Are there any questions? Yes, in the back. How often did we find overt contradictions in papers? At the moment we only know this from human annotators and we know that there are enough to be worrisome. I can't give you precise numbers. By the end of the summer I'll be able to tell you that for a thousand papers. Now I also have to point out that those are contradictions that are recognized by machine. And what that means is that some contradictions won't be detected and other things that are detected won't actually be contradictions because the machine doesn't do a perfect job. But I think we're going to be surprised. How do you deal with the rate of false positive and false negative results or conversely true positives or true negatives? There were some reproducibility studies that said that biological mechanism studies from academic labs reproduced in industry had about a 10-20% validity rate. And probably negative results or those results that aren't found have an even lower rate. How do you account for that? That's a good question. The science isn't perfect and sometimes people screw up. It's not overt. Sometimes people overstate the case. They give you impression that this is a general result and actually it's a very tightly preserved result. There are all sorts of ways that the knowledge can be wrong in one way or another. How do you do it? You compare what's asserted to what you know. And the more you know the better you are able to judge the veracity of what's asserted. That's one of the reasons that we think big mechanism is a valuable program. The bigger these models are the more the more knowledge there is which to compare any given new result. Thank you for the very illuminating talk. Second time I hear you. Every single time I hear you I think we're so wrong on just doing it by hand basically. Using human heads only. So I have one comment and one question. I do believe that you're right that we are reaching to the level of knowledge that no single human no matter how intelligent they are is going to be able to tease out things. Even in this program that we are in this symposium we generated a ton of data and we're finding we're finding some new stuff because they're very obvious. They're basically hitting us on the face. But I'm pretty sure that we're missing a bunch of stuff because we have models in our head that don't contemplate a new information that is more subtle. So this will be very welcome. So my question now is like I'm so glad that you're working with Frank because we don't have to be cleaning up the mess. When do you think this system is going to be able to ingest more than just Russ? And how do you intend to give this information out so we can use it? Good question. So it gets back to what I was saying earlier about how much you have to know about something in order to read well. There's really some stuff in the system that's specific to RASP, but not very much. So it's not a RASP specific system and if it were then this program wouldn't really be a success because it's supposed to be about building big mechanisms generally. So when can we give it out? I think by the end of the summer. I mean actually a lot of the software is already available. The Manchester folks I believe make their stuff available. The University of Arizona folks. James Allen's parser is available. Putting them together into a system that's easy to use is a bit more challenging. I'm about to start a project on that basically making it easy for biologists to construct their own pipelines. But I think it's better than six months. Okay, well let us know. We'll let everybody know. Yeah, and we would like that because that's the best way to test the stuff. Thank you. In literature there are two experiments done earlier like Deep Blue in IBM learned to play chess. Then Watson project where they learned to play Jeopardy. In both of them there was a huge learning effort was there. Years of very comprehensive learning and learning what made the rule predictable and responsive. So in your project what is the learning component and what's the outlook there? Well I spoke mostly about reading. I think that there's going to be a lot of learning in assembly. Because we have to learn for example what are plausible rules. I gave a sort of a very preliminary example of that. Rules that mention these sorts of things also mention those sorts of things. I think there's going to be a lot more learning there but right now on the machine reading side it's learning, learning, learning all the time. I mean we're going through here's an example. Learning paraphrases. So if you go to gene ontology for any given thing it will give you sort of three or four paraphrases. We're learning hundreds of paraphrases for anything in gene ontology. We're learning of hundreds and thousands of different ways of referring to the same thing to try and handle the co-reference problem. So learning is absolutely essential to this project as is understandably big data. Thank you. Hello. This is very exciting. We curated a database called GeneSigDB and we've curated about 7000 papers as part of that. One of the things that we've run against when we've tried to do much, much simpler text mining is we're restricted to PubMed Central. So I'm very excited to see large projects like this will there be a push towards open source publishing? Absolutely. And how is... Well I mean we DARPA can't steal people's papers. So we're the same thing. So we're working only with open source papers right now. It is however a real limitation. So what do we do about all of the papers that are not open source right now? We have no plan. We just don't know what to do about it. Is this going to produce a new way of writing papers? Well I certainly hope so. Because if we understand language technology well enough and if we understand what a machine needs to know about what a paper says, then we can start imagining easy to use markup languages that go far beyond keywords to talk about the content of the paper. And I skipped over a slide that talks about a little bit about that, but if you wanted a break we can talk more about it. Yeah, no it's very exciting. There's a lot of restrictions with card gene ontology. Thank you all very much. Thank you. So we'll move on to our first session of the morning and we're going to hear about an update from the popular renal cell carcinoma by Dr. Marston Linahan.