 So Andrew is the Distinguished Professor of Computer Science at the University of Massachusetts at Amherst. He is also the Director for the Information Extraction and Synthesis Lab and the Center for Data Science and was able to say that without the slide so that's good. I think you know Andrew we're especially excited he's you know 250 publication over 50,000 citations all in you know AI core AI, machine learning, natural language processing, you know reinforcement learning and so his areas in terms of being able to learn representations to great learned representations from data and structured knowledge to support higher levels of reasoning is truly at the the forefront of AI research. So with that please please welcome Andrew. My screen didn't flash yet. Thank you Lisa. I really enjoyed Rob's talk so much and I actually hope that I have some research that may help in more in the future. So I want to thank IBM for the invitation and for their support while in here I also want to thank sorry the Center for Data Science on which I direct which is not only enabling us to increase dramatic leaves they by 14 tender track positions the number of machine learning faculty were hiring but also enabled a lot of this research because we used a large grant to gather a large computational cluster that has over 800 GPUs in it which has just dramatically changed the way that we do research. I think it might be I was talking with Joshua maybe a while ago saying it might be the largest academic GPU installation anywhere in North America. Okay and for the industrial folks in the audience I want to make sure that you know about in about a month a data science specific career fair that will be having which a lot of our industrial partners have found a lot of good matches this it's mostly graduate students and they present posters as well. Okay but we're here to talk about knowledge representation and reasoning which is one of the core aspects of artificial intelligence and in particular it's application to science and we really could use a lot of help because the progress of science is accelerating rapidly and this means that the volume of publications is increasing as well there was a survey recently of scientists 85 percent of whom said I'm not able to keep up with all the research my field and I think the other 15 percent are lying and they need not just a search engine to help them find things they're in the midst of designing experiments deciding what directions to take in science next they really need navigational tools that will help them reason about science in order to help them accelerate their scientific progress and there are many companies which in their youth started by focusing on keyword search which now that we're spending more time in front of smaller screens are also thinking about dialogue systems question answering knowledge bases a more structured view of the world in order to provide more succinct answers to questions to provide ways to browse amongst entities through various types of relations types of relations here are shown in the bold text here so these are types of relations the catalog of the different types in this knowledge base was designed by humans to say well what are the kind of different semantics of relation types that we want to create and they thought very hard about it they created a large list but in a sense it's not really large enough because just as in web search the great diversity of things that people ask about the majority of that mass is in the long tail of what people are asking about and the broad the broad diversity so for example if you want to know well who did Jeff Bezos criticize the Google knowledge graph for freebase so many others can't tell you because they didn't create this relation in their knowledge base so what we're really interested in is a knowledge base with an open schema sort of like keyword search where you could ask about absolutely anything but at the same time that has a structure of entities and relations and one that will support reasoning and that's really what this talk is going to be about so we're going to talk about going from text to knowledge bases to reasoning and especially doing this in the context of science and this is an area which I've has interest for quite some time about a decade before Google scholar existed some colleagues and I at Carnegie Mellon built a system that had many of the similar features you can tell how old this is by looking at the header in the web browser there when I read UMass we built a more modern system that knew not just about people and papers but institutions conferences journals grants advisors and the rest and really now we're working on multiple trusts one of which is to build a knowledge base of all scientists in the world and their expertise and their career path history by gathering information from any places as well as to build a knowledge base of scientific entities and relations materials equipment organisms processes tasks methods many of the things that we heard about from Rob and so we'd like to do this for the obvious reasons just as a brief aside I'm also very interested in revolutionizing scientific peer review I think the time it's high time that this happened the way that scientific workflows for review happened right now or designed in the time so that we are printing research on paper and sending it through the post and we have much better methods of doing this now we've built a system called openreview.net that was actually but just served the deadline for the iClear conference just last week accepted about a thousand and a half papers in just a few hours it's being used by about 20 different venues now we're in discussions with ICML, CVPR and others and if afterwards you want to talk about that I would love to tell you more about that all right but what I mostly want to talk about here is building a knowledge base of scientific entities so I had been working with a Canadian startup company that shared many of the same goals that was acquired by the Chan Zuckerberg initiative this is not Facebook but the philanthropic arm of where Zuckerberg has pledged to give away the vast majority of his wealth and they have an extremely large knowledge base based on of scientific entities if we just take one column here so like 26 million papers over 400,000 drugs 36,000 journals many of these connected in a web and we've been working together to both enlarge that knowledge base and to do reasoning about it and with techniques that I'll describe in a minute we've also been collaborating with folks here right here at MIT in material science to gather hundreds of thousands of research papers from material science automatically find in the section where they describe the recipe never really does look like a recipe for cooking up a brand new batch of say car battery materials of the future, extracting that recipe as a structured graph and then doing analysis on a large collection of recipes to understand the relations between the recipes and the properties of the materials that result to suggest brand new recipes that have never been tried before that might help us address global warming issues a little bit quicker we also have been working with the U.S. patent office to design new more accurate methods of resolving authors of both patents and research papers and this is work by Nick Monath who's here with us today and has a poster here today and his work is now being deployed as part of the U.S. patent office workflow okay but those are just a few examples mostly I want to talk about that technology for how this happens and then a traditional knowledge base the knowledge base consists of entities and relations and as I was describing before some hand design schema for entity types and relation types and this is what's exactly what's in freebase and various other traditional knowledge bases as well as in medical knowledge bases like SNOMAD and many others we're going to be talking instead about replacing the hand design schema with something we call universal schema which I'll describe in a moment which is based on vector embeddings and deep learning in order to gain some generalization capabilities and also using textual content and looking quite a bit at the degree to which that raw text itself can serve as a knowledge base but we also want to build a graph at the same time in order to be able to navigate through them and this landscape of sort of smooth landscape of working with raw text versus a very structured content but today the two technical things I want to focus on most are some work on chains of reasoning with recurrent neural networks in a way that really bridges symbolic AI and vector-based or neural network AI and we're going to talk about using reinforcement learning to search for proofs in this space and I also want to talk about efficient methods of representing taxonomies and common sense using a new technology we've been working on recently that we call box embeddings okay let's first start with an introduction so to build a knowledge base we usually start with a raw set of text from there we need to find some entities we need to resolve multiple occurrences of those in both unstructured and structured sources do relation extraction on top of those put those into a knowledge base which then given a query can hopefully answer some questions and what I want to focus on most here are these these types of data the different types of entities the different types of relations and where they come from so let's let's first step back a little bit given some raw text and with apologies this text does not actually have to do with science because maybe many of us aren't bench scientists and I want to work with data and make a presentation that we can all understand intuitively but I'm going to rely on your imagination to understand how this would apply to science so we're given some raw text we from that we extract some the names of some entities and we resolve them knowing when Bill Gates and Gates are referring to the same person from that we can build a graph of these entities and the sentences in which they co-occur but to really understand what's going on here we we'd like to know well what are the types of the entities appearing in the nodes of these graphs and what are the types of the relations appearing on the edges between them so let's focus on the entity type for a moment here so here are some types that we may have had in our head that we thought would be important for our application and but and we'd also we're going to end up importing data from multiple sources to help augment our graph but that data will come from some other source and some other person to design some other schema that's almost surely not going to match our schema of different entity types they'll just be they'll be different and unfortunately it's it's usually not so easy just to map them to terministically one on top of the other rather than overlapping by containment or by exact implicature so often these concepts overlap you know sort of parsley like this so like a film subject which is one of the columns here is often a person but but sometimes it's a dog or some other creature and so they overlap parsley so rather than trying to map all of our semantics onto a single schema to rule them all in universal schema we're going to keep about keep keep around all of our input schema not try to cram all of our semantics into one into one box keep embrace the full diversity of it all and then learn something about how these different schemas have some mutual implicature among them and furthermore so so actually i'm only going to show a few examples here but you can imagine actually we might be working with 20 or 100 different databases all with their own different schema and each one they have thousands of different schema items in them and in addition to that we're going to take a lot of our schema from the raw text itself grammatical compositions like a positivist and others express entity types and and we'll let the raw text which is conveying meaning serve as its own schema as well so in additional tens of thousands or hundreds of thousands of schema items might come from here as well so we're going to keep these all around and this is what i mean by universal schema so it's our job to take some input context and to classify the appearance of this entity as to which which entity type it is and rather than having a set of parameters shown as a purple vector here in the original raw feature space we're going to learn them in some compacted latent space i think about it's like a word embedding in say 100 dimensional 200 dimensional space where the individual dimensions don't really have any meaning here but we're going to learn these vector embedding so that types that have similar meaning appear near each other in this vector space and the entities we want to classify will get embedded in the same space and i can make a prediction about whether some entity is an instance of some type by doing a dot product between these entities so at training time i'll observe from i'll observe from my raw data some of these entities and which which types they belong to and it's my job for the sake of full completion to do matrix completion to fill in some of those missing ones so that even though i didn't observe originally the bill was an executive i can predict unless he is an executive from the co-occurrence of other types that i did observe him for and this kind of matrix completion is exactly what netflix uses to recommend movies to you given the partially completed matrix of what you and others have told it about what movies you'd like the same thing can be done for relation types in which we have different types of relations in the columns from both multiple structured sources and from text and then entity pairs and the rows and again we can do matrix completion to predict new relation types and just to give you some visceral sense of how this works for example this comes from some real data that we ran on the system observed that kevin boil was a historian at ohio state and from that we predicted that he's a professor at ohio state and and yet in a demonstration of the model's ability to represent asymmetries when we observed that freeman as a professor at harvard we didn't necessarily infer that he's a historian um all right so here we end up with the knowledge base that's a graph of entities and relations in which rather than having symbols sitting on the nodes and edges of the graph to communicate the semantics what we already mean here we have these vectors instead vectors representing a broad diversity of different kinds of semantics and i find this really interesting um think about mixing vector representations and and graph representations and furthermore thinking about how you know philosophy and computer science and mathematics has spent decades studying what it means to do logical reasoning on symbols um and here we're about to ask ourselves well what does it mean to do logical reasoning on vectors instead of symbols because reasoning is exactly what we'd like to do in this kind of knowledge graph so here's here's what i mean by this let's say that someone asks about the nature of the relationship between melinda and seattle well we didn't observe any raw evidence of what that relationship is but there is some other evidence in the graph that we can form through some chains of reasoning right that literally consists of a chain through the graph or some path to the graph right melinda is married to bill whose chairman of microsoft headquartered in seattle and perhaps we can use that as evidence with some probability that perhaps melinda lives in seattle and in traditionally i would write this down as a symbolic rule right if a is the spouse of b and b is the chairman of c and so on then a lives in d but what if in the data that we're asking about instead of chairman it was ceo okay well we need another rule for that or what if it was coo instead okay well we need another rule for that or what if it's child of instead of spouse of well now we need a combinatorial combination of a number of different rules to cover all these cases and hence the difficulty of symbolic ai but here we have vectors on these in this graph instead which are exactly designed to help us help us generalize and so we've been working with methods that consume arbitrary length chains through a graph like this at each stage pumping that vector into a recurrent neural network that consumes the new piece of context in the chain at each stage outputs a new vector that's that represents the nature of the semantics of between the two of the two nodes that are the end points so far and when i add another link then i now you know compose my semantics further to represent the nature of the relationship between them and indeed when we train on data and pass a path like this into our network it does predict um uh there were the vector it produces is very close to the vector for the symbol lives in so the neural network has learned how to do logical reasoning uh in in this method that goes beyond symbols that learns to generalize so um we've run this on a moderate amount of data from free base and from text and to give you a sense of the kind of paths that it can learn so here we're predicting the relation a book and the original language in which the book was written and um and here's a path that the method discovers it goes from a book to some other book that's previous in a series of books think like harry potter um to the author of that book to the nationality of that author to somebody else who shares the same nationality to what language do they speak and that seems like a reasonable change to have some level of inference about what what language we think the book was originally written in and there are many other examples i'm not going to take the time to show you all right so for some of you in the audience you've seen some of this context before um sorry sorry let me say first that uh you know as is typical in research um in comparison with an older symbolic method that came out of Carnegie Mellon in 2012 over time we are you know gradually making accuracy improvements and I think we're on our way towards a system that would be usable for deployment um but uh the some newer work that i want to describe uh today are um addressing the situation that occurs in really large graphs in which it takes some careful effort in order to find the um to find the chain that gives you a proof for the answer that we um for the answer to our question um right so there's some paths um that might be relevant to the question that we have but there are many other paths that are kind of just don't really carry any meaning at all are just not relevant to the question that we're trying to ask um and in our original work we were just searching through the graph by doing random random walks but that's really not sufficient for the large scientific graphs so we've been doing work on using reinforcement learning to guide the search for paths that prove the answer to the question that's being asked um so what does this look like we uh start at some entity at each entity there's a large choice of different outgoing edges that we could follow in the graph we uh um we we make some choice we continue to follow the graph until with the model thinks that it's arrived at an answer if the answer is correct then we give the reasoning agent a reward if it's an incorrect answer we give it no reward we give it multiple chances to explore um credit for right and wrong answers back flows backwards through the graph um and we can do some learning so this is a case of we want to learn this end to end by gradient descent uh we are you know the training objective is to maximize expected reward the choices to be made are discrete and so we use the we need to use the reinforced trick in order to get a smooth gradient out of a set of discrete choices but this together with some uh some various methods like multiple rollouts per question allows the system to converge and to reach um to reach very nice answers some other work that i'm not describing on a slide here but i think for which there's a poster today is using methods that are quite similar to this uh even in a very large corpus deciding even which paragraphs to read next in order to just uh to look for the answer to some question in some open and QA system um so to give you a sense of the kind of path that this reinforcement learning agent is able to find what it thinks of as most efficient when asked say what's the the country of origin for the film step up revolution it had no immediate evidence for what that was but it jumps from the film to its production company to where the production company was located et cetera all right so this is one so here are some examples in general knowledge we're now actually you know very interested in making work on in in biomedicine we've been collaborating with andrew sue at the scripts institute on um applying this to to biomedical data and worth looking at much more complicated paths of of longer length and uh yeah we're excited about this work and would be very happy to find other collaborative uh collaborators um all right so i also find it interesting let's see how am i doing for time uh that you know these when there are many many types it's often the case oh sorry let me see something else first so we um what we talked about so far fits into a worldview that i think somewhat like this at the bottom in orange i'm trying to show some raw text here with entities in these ovals and the crayon mark is like some textual evidence of a relationship between them and we use those to you know after linking them to some entities to infer that well i think there's some relation between these entities in the knowledge base and this is pretty much describes what we've been talking about so far in terms of raw evidence and and the uh the inferences we're making at the entity level but i actually think that what we should be doing should look much more like this and and that is that sometimes it's not sufficient to reason about relations at the raw entity level but sometimes the relations are only really true in particular contexts like this gene may upregulate this protein um but uh but you know but only this gene when it's expressed in liver tissue and not some other tissue of the body or under certain metabolic conditions um and so we need to reason about what i think of as sub entities um or entities in a particular context maybe somewhat um intuitively analogous to there's barack obama the entity but there's all then there's barack obama as teenager in hawaii and a senator from illinois and as um you know president and past president and and different relations were true of obama during these different stages um of course we can also make some inferences at the entity level but then there are also relations that can learn among types at even higher levels of abstraction and i think about this as common sense um relations among types and so we'd really like to be reasoning throughout this graph and to reason about um hierarchies of concepts um that exist in some hypernome relation um uh throughout and so this has us thinking quite a bit about these types of entities and how it's often the case that they exist in hierarchies and how can we use these taxonomies to help improve inference so we can work with various embedding methods that are that are aware of these of these taxonomies um and use distance measures between vectors that that can be asymmetric to represent the asymmetric is there a relationship there so things like bilinear mappings that go through instead of just a simple dot product through some through some matrix in the middle yield some asymmetry also if we have complex vectors then those also have some asymmetry we can represent hierarchical relationships there as well and then um when we see some textual mention we can train some model that will create a vector for some textual mention and uh and then ask well you know does this vector have the is a relationship with founder in order to answer questions um about types but furthermore also to leverage the hierarchy because we will also have trained on pairs of entities that of pairs of types that fall into the hierarchy and um so we've been looking at various sources of data in which there are fine-grain types that exist in a hierarchy and probably the most widely used data set for fine-grain types has been the figer data set with 120 types we wanted something much more substantial to reason than a much more fine-grain fashion so we created a new data set called type net with almost 2,000 uh types in it and what we find on this data set um which this gives you some intuition for its level of granularity um and what its shape looks like um is that um leveraging the hierarchy yields to more than 20 percent decrease in relative error like here going from 68 percent um up at the top to 78 percent by by using the fact that we know about these things fall into a hierarchy all right so lastly we've been thinking further about other more rich ways to represent um taxonomies ontology this kind of implicature once based not purely on vector points in space but something that really in a way more inherently represents the breadth of a concept and a few years ago we did some work on using um what we called Gaussian embedding so for each concept rather than being a vector there would be a Gaussian region in space that can have different variances and we after training on some raw Wikipedia text we're able to learn that person is a broad concept with high variance and that a musical composer is something much more specific that sits inside of it and that famous is a broad concept and we even just naturally learned that Johann Sebastian Bach sits exactly at this intersection which was nice to see um and so yeah an embedding method that is able to capture variance is has some nice advantages um but Gaussians in particular has some disadvantages and one of them is that it's not closed under intersection at the intersection of two Gaussians is not a Gaussian and so recently we um we've been looking at some other methods that do have this closed property one of them is worked by others on order embeddings that defines each has free you know each concept of a vector that's constrained to be non-negative and says that that vector actually defines a region of space that spreads out away from the origin and and so the intersection between two concepts is exactly the point where the intersects here as well as the intersection that spreads away from the origin and so um this is closed at intersection but the model mathematically is just not able to capture negative correlations and so we just last year have an ACL paper on this thing we call box embeddings in which for each concept there is an n-dimensional box and the box is defined by a center and by its widths in each dimension so it effectively just doubles the number of parameters not that much it's closed under intersection because the intersection of two boxes is always another box and it can represent negative correlations so you can think about these a lot like Venn diagrams right and the you know the volume of the box will be proportional to its marginal probability the volume of intersections can be used to calculate conditional probabilities and it can also condition on multiple other boxes looking at the intersection of all of them to answer really quite complicated questions about conditioned on this evidence what do we believe about the probability that certain other things are true in fact we're even investigating these as a whole cloth alternative to graphical models. I think there are some things that these definitely will do worse than graphical models but some things that they will do better than graphical models and that they can efficiently do inference on say hundreds of thousands of variables and answer probabilistic questions about them in ways that don't rely on their being sparsity among the dependencies among them. So on some toy data that we just created by hand for the sake of just understanding what was going on in the model what we find is that in comparison with order embeddings our model is able to you know learn sensible boxes with sensible overlaps and to use the space well and in fact fit it much better so here's the true conditional probability table that was part of our training data order embeddings was not able to match that very well at all but our boxes were able to match it quite well and we can see that it's also able to represent negative correlations so here we're asking well what's the probability of being a plant conditioned on nothing just the prior is this but conditioned on being green your probability being a plant goes up what's the probability of being a plant given that you're a snake well it's zero that's a negative correlation and that's what order embeddings would not have been able to represent at all and even with multiple queries we can represent things like this so you know what's your probability of being a deer given that you're not white and so on. All right so but what really matters and what's more interesting to me is something like this in which we are operating on arbitrary pieces of text and doing what I really think of as a kind of common sense reasoning so there's a Flickr data set of in which we're going to not concern ourselves with the images but which is just the textual captions that people put on these images and it's a large collection and for each caption we're going to use a parser to break it down into smaller pieces so that we can get both sort of fine-grained pieces of text and coarse-grained pieces of text that describe the image and then we can calculate conditional probabilities right by you know what's the what's the first of all it's the marginal probability of saying finding an image of a dog well you know how many images were labeled with the dog some are even if the original caption was something much much larger there's the smaller case here we can also ask about joint probabilities you know what's the probability of seeing both dog and grass together well here's one and we can do that by counting the number of images of which dog appeared at all or which grass appeared at all and what we're going to do on our model is take an LSTM that's going to run over arbitrary length sequences of text and the LSTM is going to output a box and this and then we can look at overlaps between these boxes to ask conditional probability what I think of is sort of common sense type questions so here are some simple examples if you observe a caption like a person drags a suitcase what's the probability that you would say that that implies that luggage is being dragged well that's probability one if you're if you're told well there's a somewhat holding an instrument what's the probability that you're in the basement well that's low it's it's not zero but but it's low and I would say that that represents a kind of common sense and we're now applying this to to scientific texts are one of our goals is to capture scientific common sense and this yielded a new state of the art on flicker and there's some new results about this which you can also see in the poster by Lorraine Lee one of my students who's here today all right so what we what we really want to be doing in the future is to be reasoning about large graphs like this with with many entities at the sub entity level and at the type level as well represent their mutual implicature with boxes and reasoning through chains in this graph various levels of abstraction throughout and this is the research path that we're on we've talked about representation and reasoning in text data using instead of symbols universal schema vector embeddings and representing taxonomies by using box volumes we've talked about chains of reasoning and with that I'd be happy to answer your questions thank you so how does this work oh they're there sorry thank you okay thank you with scientific papers what happens when there are contradictory relations that are extracted and added to the graph okay yeah wonderful so in the original work we take all the text as training data and as so often happens in machine learning you know we both in a way assume that the training data is true but the but the training is robust to noise in that data when asked questions about what might be true you know again as is often the case in machine learning it will rely on most statements in the training data having been true and we'll sort of we'll answer you know so what the majority opinion was in the original um raw training data um I in the the chains of reasoning work so far what we do is we look for many possible paths that could um provide you know evidence of a positive answer to something and um and you know zero in or pay attention to that one path and then ask is the is the strength of that evidence above threshold for us to say yes and I think this was a reasonable way to start but um but what we're considering doing next um which I think is necessary is also to look for negative evidence like look for the path that would make us most strongly say no and then to compare the positive and negative evidence like someone might ask well you know um to bill and Melinda hate each other well I don't know somewhere they might find some text that talked about some disagreement that they had and and if we zero one of that one piece of text the the system might say well they they know that they hate each other you know but that's then you know we should also have looked for um you know evidence to the contrary and seen all this evidence that well they you know they run a nonprofit together and they're married and they have children in order to be able to say well no of course they don't hate each other um all right it's the probability of plant giving car actually actually zero I think the biologist can answer that better I guess there are venus flight traps etc but if you really mean to ask me something else feel free to pipe up again so how did you go with sparsity of the label data um let's see here I mean you know in matrix completion for Netflix an incredibly sparse amount of the matrix is actually filled in but when training happens and um and you know one of one of the powerful aspects of these matrix completion methods based on vectors that that tend to work quite well even when the data is sparse and this is also um what we're finding here it you know of course having more labeled data helps and we do have a thread of research right now that's working on active learning in the universal schema context if we did want to go ask for ask humans for um for some additional labels what should we ask them in order to get the best benefit um but I would say that so far our evidence is that uh unless that they worked well with sparsity all right thank you very much yeah