 Yeah. Welcome, everyone. Our first speaker's Andrew McCollum, so I don't think I'd have to do a large, great introduction because everyone should know him already. He's one of the people who has really worked for a long time on the intersection between machine learning and knowledge representation. And he does much work on information extraction, in particular. If you have been there at the poster session on Monday, you have seen very many posters with his name. And I think today the talk will give a nice overview of the topics covered at research at UMass, where he's a professor. So yeah, thank you for coming. Thank you, Veronica, for the invitation. I'm very happy to be here. I enjoyed this workshop so much last year, and I'm glad to see it continuing. Another event that's continuing that I want to mention as I get started is about 10 years ago, I started a workshop in automated knowledge-based construction that happened in Grenoble. And it continued as a workshop nearly every year after that. And just last year, we turned it into a free-standing conference for the first time. And folks like Sebastian Riedel and Fernando Pereira and many others were there. It was a great success. And it will be happening again this coming June in Irvine, California. There'll be a paper deadline, let's see here, most likely in early February. So we'll look for those announcements coming out soon. All right, so I'm happy today to talk about deep learning for knowledge, representation, and reasoning. So sometimes a machine is asked to make a decision from some perceptual information, like an image in which it's asked to say whether this patch contains certain kind of cancer. But other times, a machine is asked to do something that feels less like type 1 kind of reasoning that I described above, more like type 2, something that might require a few steps of reasoning. Now, of course, we can take a test like this and kind of turn it into type 1-style processing by maybe just trying to find documents that are the nearest neighbors to that query. And that's essentially information retrieval. And that goes one step towards helping people answer this question. But it would be nice to get a more detailed answer. There's been quite a bit of interest, of course, lately in question answering from text that we pick a document, pick a segment from that document, and do some additional work on it to try to pull out a single entity. But sometimes in order to answer a question, we really need to integrate information across multiple documents or across many different sources of information. And that's harder to do without some notion of what an entity is, what entity resolution are, and then some notion of what those relationships look like. And so this is one reason that a lot of people are interested in knowledge bases, graphs with nodes for entities and edges for relations. And so given a question like this, we can find, like, where heart disease is represented here. And although there may not be answers that are just one hop away from that, we can do some reasoning through some multiple hops in order to find some answers to our question. And also see the textual evidence that caused those edges to exist. So because of the great capabilities of these, there's been a lot of interest from many commercial entities, including IBM as well, of course, in building such large knowledge bases. So now I want to talk a bit about where the purple items in this picture come from. What I think of as the schema of entity types and relations. So here, for example, in this entity, it's called a disease. And that might be a fine-grained enough notion of what this entity is in order to answer some questions. But in other contexts, we might wish that we had a more fine-grained notion of what this is, like a disease that's a genetic disorder. Here's another entity that has a particular type. But maybe in other contexts, we would wish that we had a more coarse-grained notion of what this is in order to help us reason with more generality. And the same things apply to relations as well. And so one of our major goals is to build a knowledge base that, in a sense, we think of as having an open schema that lets us ask about things at various levels of granularity or ask about things according to many different schema types that arise naturally in the world. And yet, sort of with the same level of openness that we're used to having when we do keyword search for documents where we can just type in anything, ask about anything. But at the same time, preserve entity relation structure and provide the ability to do some reasoning. So that's a lot of what I want to talk about today. And as a first step, I want to introduce you to the work that we've been doing over the last roughly decade in a system that we call universal schema. So our goal is to build a knowledge base. And we'll have many sources of evidence, but among them will be some text. And what does it look like to build a knowledge base from text? Well, we've got to find the entity mentions. I want to do entity resolution to find the multiple places where Bill Gates has mentioned. And I can do this for many, many different entities that appear in the text and elsewhere. And I, as the designer, may have a notion of what knowledge base schema I have in mind. And so I've defined that. And these form some columns in this small matrix that I'm building up here. And I may even have some prior knowledge that lets me fill in, at least for some of these entries, which entries are people or what are their various different types. But it may not be complete. I don't hear. I don't know what Seattle is yet. I might try to augment my information by importing some previously structured knowledge into my knowledge base. Some other knowledge base built by somebody else that has information that's relevant to me. Now, the person who built that knowledge base probably didn't have exactly my schema in mind. And it comes in some different schema. And I could try to align their schema to mine exactly and sort of declare my schema is going to store the truth and that will be the end of it. But there are many concepts that sort of overlap partially. And making that alignment can be surprisingly difficult. And so the approach that we've been advocating in universal schema is that we're going to keep around all of the input schema all at once, not try to pack all of our semantics into one set of boxes to fit them all, and then, nonetheless, learn mutual implicature amongst all of these. So we're going to embrace the diversity and the ambiguity of these many different schema. And another schema actually is the symbols that come from raw text, just the textural expressions for things are yet another schema. It's the natural language schema, and it's very big. And so these matrices can have tens of thousands or hundreds of thousands of columns of ways to express entity types and relation types. And I may observe some things directly, in which case I have some things colored in here already. Here, let's focus on relation types. So if I want to make a prediction, well, is Obama, or was Obama, sorry, this slide's a little bit out of date, a president of the United States, well, that was directly observed and so I can answer that. But I'd like to be able to answer for all the other cells as well, whether they're true or not. And so I want to, I want to sense to do matrix completion here or get this to generalize, and they're going to do that by giving myself vectors for each of the rows and columns, a vector to represent each relation type and a vector to represent each entity pair. And I train this by picking one of the positive observed cells and look at the dot product of the vectors correspond to that intersection, picking another empty cell in the same row that I'll assume for the moment is negative, and that there's another dot product from its corresponding vectors. And my objective function for training is that the first dot product should be larger than the second. So in a second, this training instance will take these three vectors and the objective will say, well, since Bill and Steve are friends, I want these to be close to each other so they have a high dot product. And since Bill is not the president of Steve, I want to push those further away. And training then arranges all of these vectors in a space that causes them to answer these kinds of questions correctly. Learning things like CEO and being an organization member are related to each other, head of state and president are related to each other, et cetera. So here's some examples that come from some real data. Here we observed that Volvo bought a stake in Scania, in Scania, which is another car company, and within we infer that Volvo owns a percentage of Scania, oops, sorry, which is a little fast there. Here's another example that I'd like because it shows the model's ability to capture asymmetry. We observed that Kevin Boyle is a historian at Ohio State and then predict he's a professor at Ohio State, but when we observe that Freeman is a professor at Harvard, we don't infer necessarily that he's a historian. All right, so this is my brief introduction to universal schema. So what do we end up with here? It's a knowledge base, but where instead of having just symbols on the nodes and edges of our knowledge graph, we have vectors to help us generalize. And I find that just particularly fascinating. It's the combination of a structured view of the world and sort of an embedded, a smooth embedding view of the world. And so now I'd like to talk briefly about how we do reasoning here. This may be a review for some of you who have heard me talk before, but I'm gonna be talking about some newer work when I get into box embeddings in a bit. So let's say that somebody asks you about the nature of the relationship between Melinda and Seattle. Well, you didn't observe any direct edge between them. They never co-occurred in a sentence or in any other knowledge base. So firstly, I think, well, I have no evidence. There's no answer that I can give for what I think that relationship might be. But there is, there's other information around those two entities in the graph. And perhaps I could notice that Melinda and Seattle are indirectly connected by a path that says, well, Melinda has Bill as a spouse. Bill is the chairman of Microsoft, which is headquartered in Seattle. And using that chain of reasoning, I might be able to infer with some probability that I think their relationship lives in. And in an old style symbolic way of reasoning, I would write that knowledge down with a rule like this. If A is the spouse of B and B is the chairman of C and C is headquartered in D, then I infer with some probability that A lives in D. And that's all well and good. But what if instead of chairman, I have CEO, I need another rule for that, or CEO, there's another rule. Or what if instead of talking about the spouse or a hop in my chain, I'm talking about a child of instead. Well, now I need a combinatorial combination, number of different rules and it's looking painful. So we have been working, leveraging the fact that we have embeddings on the edges of this knowledge base to try to do reasoning, not on symbols, but reasoning on vectors instead. And after, of course, many decades of research and work on what it means to do logical reasoning on symbols, I find it especially fascinating to consider, well, what does it look like to do logical reasoning on vectors? So we've been using a recurrent neural network that consumes an arbitrary length path at each step, consuming the vector of the new edge that's being consumed and outputting a vector that represents the semantic composition of what it believes is the nature of the relationship between the end points so far. And by the time it's completed this path here shown here, it produces, our model indeed produces a vector that's very close to the lives in relation. So the parameters of this recurrent neural network have basically learned, I mean, embedded in there are rules of logic about how to compose the meaning of the semantics of these steps of this chain of inference. And we've been doing this work since 2015 with a number of improvements. When run on a moderate amount of data like this, they can hear some example predictive paths that pop out. Here I'm trying to predict, given a book, what's the original language in which it was written for a case in which there's that direct H does not exist in the knowledge base. The path found by our system goes from the book to another book, the previous one in the series of books. Who was the author of that book? What's another person who shares, sorry, what's that person's nationality? What's another person who shares that nationality and what language do they speak? So that seems like kind of a reasonable chain of inference to try to make an estimate of what a book's language is. And there are a number of other examples I'm not gonna take the time to show. And over time, through various different research improvements, accuracy on this task has been growing. And actually there's some work that we just put on the archive yesterday that also has to do with chains of reasoning. But in a way it's like a mixture between chains of reasoning here and textual question answering in that the chains of evidence it consumes at each step are not just individual links in a knowledge base, but actually like whole passages of the kind that would be consumed by a textual based question answering system. So in a way it's like a looser way of doing these chains of reasoning and I'm quite excited about that work also. This is work by one of my students for Jarcy Doss. All right, so in a really big knowledge base you also need to think carefully about how you find the path that you think leads to the right answer. And this is a path that makes sense but there are many other paths and some of them just don't really carry many semantics, much semantics at all. And in a big graph there are many, many paths to explore and so we'd like to be more clever about how we explore the different candidate chains. So this is essentially like searching for a proof for the answer to the question that we're trying to address. And we've been doing work in reinforcement learning to do that search intelligently. So here the setting is that you're given an entity in a relation and to answer the question you're asked to fill in the second entity to complete the triple. And so we start at the source entity, there are a number of different outgoing relations that we could choose among. The reinforcement learning agent chooses one of them as an action and continues making choices until it thinks that it's arrived at an answer and if it outputs the answer that's correct then the reinforcement learning agent gets a reward of one. Hello, Selim, welcome. I'm so happy to see you here. Thank you, well I'm relying on your pointed questions as usual, I'm sure you'll keep me on my toes. And if it arrives at the wrong answer then it gets a reward of zero. And we train this on situations in which we know the true triples. And indeed we're able to get this to successfully learn because we're trying to learn by gradient descents which requires a smooth space but the actions are discreet. We need of course the reinforce trick which is a way of using sampling to get gradients from discreet choices like this but it does indeed converge and there are some special tricks we can use to get it to converge even better. And here's an example of an inferred path that's learned by this reinforcement learning agent. Let's see here, I guess we could look at the top one. So given the film step up revolution we're trying to determine its country of origin and there was no direct connection to the knowledge base but it finds that if it goes to the production company and then asks where it's located then that's a chain that may provide some good evidence. All right, so we've been talking about reasoning about entities and their relations and textual evidence and I wanna try to paint a larger picture here. Sorry, this is a bit abstract so let me give you a guided tour here. So the orange circles at the bottom are meant to represent mentions in text of entities and the crayon edge here is maybe some text that we believe is expressing some relationship between them and the blue dots are representing resolved entities. And so the scenario I've described so far I think looks at some particular textual evidence, resolves them to their entities and then based on the textual evidence decides that well we think that there's some relation there. So this is the world view that I've been talking about so far but actually I think that we should actually be thinking with a world view a bit more like this. What do I mean here? So the light blue circles down here I think of as sub-entities. As an aside here it happens to be, I mean I haven't talked much about entity resolution here but we've been doing a lot of work there as well and the style of entity resolution we do is one that's hierarchical so it builds trees along the way. So there may be 50,000 mentions of Barack Obama and there's some node in that tree that represents Barack Obama at the entity but there are many other, there's a lot of substructure in the tree between the mentions of the leaves and the Obama node there and I think about those nodes as what we call sub-entities and maybe a lot of us don't know a lot about biomedicine including me but for Obama we could imagine that there would be some interior node there that would represent Barack Obama as teenager in Hawaii, Obama as senator as candidate for president, president and post-president and that certain relationships would be true of those sub-entities that are not true of other sub-entities so it's not always best to represent an entity say at the high-level entity level we might sometimes like to express relationships at the sub-entity level. Similarly there's further abstract structure above the entities and I think of these things as types like you know Barack Obama as a person and a man and the politician and a book author and so those are additional relationships and we can also, I mean additional types and we can also put relationships there and I think of that as representing common sense which is really quite nice. So to answer a question we may like to operate at different levels of abstraction here and actually make chains of reasoning in a graph that looks like this which sometimes will operate by traversing edges up at this abstract level because we can answer a question perfectly fine there but in other cases we really need the context and the specificity maybe even to traverse edges all the way down to some particular piece of text capturing some particular context along the way on our chain of reasoning. All right so we've been thinking then quite a bit about these very deep abstraction hierarchies and that's reflected in the fact that these entity types that we have here which we've previously been treating as or like in a sitting in some flat space actually would be better represented if we put them into a hierarchy and so a couple years ago we were looking at if we have a hierarchy like this can we leverage this to do an even better job to learn better vectors that give more accurate answers and the answer to this is yes for this we need some mathematical operation between vectors that's not symmetric because I'm trying to capture a directed arrow here an asymmetric relationship between say person and athlete you know all athletes may be persons but not the other way around so I can't use dot product which is symmetric but I can use a bilinear model which is non-symmetric that's one choice and other choice is to use complex vectors vectors that have complex numbers and all of their elements and that also just by the way the mathematics works out there happens to be non-symmetric and so when we do this we can also then take textual mentions and learn to map these into vector space and then be able to predict their types as well so when we train in a way that trains with an objective on the matrix that I described earlier the objective I described earlier but they'll also add to our objective the notion that you should obey the tree edges that we have as prior knowledge then indeed we do a better job so I know which I'm about to describe in a moment so let's see many of the existing type hierarchies that we were out there in previous data notably though at the most prevalent amount and I think the FIGAR data set of fine grain types wasn't nearly big enough for us we were interested in having something much bigger so we created a new data set we called type net from the union of word net and free base and some automated editing as well as picking the types that really made the most sense and chopping off the upper parts of the word net hierarchy that seemed just a bit too abstract and end up with hierarchies to give you a sense there's some samplings of some parts of the hierarchy for parents of an Olympic host city or parents of cheese or parents of a drafted athlete and so you can see it's not a tree, it's a dag which is interesting yeah and it can be quite deep so and what we found is that in comparison with our original methods that did not use the hierarchy at all represented by these upper numbers 68 and 69 using the tree in various combined ways yield here more than 10% increase in accuracy which we are very happy to see all right so now for the main body of new work that I have in mind I want to talk about the following so here's this rich hierarchy and we've already noted that it's not just tree shaped it's dag shaped and although I've drawn hard edges here I actually really want to claim that I'm compining for something a little bit softer or something that would be able to represent probabilities you know things like many politicians are book authors but not all of them and so it would seem unsafe to put an edge between you know from politician up to book author but I'd rather not just drop that known correlation on the floor and not represent it at all so what kinds of representations can I use to represent dags and also probabilities and also sort of gain some of the softness that I get from representing things in terms of in some embedded fashion and this leads me to work that I'm extremely excited about that we've been doing over the last three years or so on box embeddings and this is work by my students Lorraine Lee and Luke Vilnes Dong Zhuzheng and also one of my new postdocs Michael Barracos actually with us here in the audience and we've also been doing some collaboration with IBM on this topic as well. All right so in most of the work on deep learning for an LP and build deep learning generally how do we represent concepts we represent them with vectors which we can think of as points in dimensional space and vectors do a nice job in a lot of ways like say let me go back so they can be arranged so that neighborhood relations amongst the vectors make sense the animals tend to be clustered together here differently from the furniture and but there are some regrets here so here I have rabbit and mammal the vectors here are not representing the notion that well the mammal is a more general concept than a rabbit and it would be nice to capture that in a very clear way and so with that in mind so they don't really capture a region with the notion that some concepts have a broad region and other ones are more narrow also the typical operation done between them is a dot product which is not asymmetric so with that in mind in 2014 my student Luke Vilnes and I had an eye clear paper on associating each concept not with the point but with the Gaussian in space perfect thank you so general concepts can have a broad variance and more specific concepts can have a more narrow variance we can we therefore have like a region for each one of these there are asymmetric distance measures between these Gaussian distributions basically we used like a KL divergence which is asymmetric we can represent concepts that are disjoint from each other but Gaussians had a number of problems that I don't have time to get into all of them but among them is that they're not closed under intersection so the intersection of two Gaussians is not Gaussian shaped here and so that made us interested in some alternatives and one that's been explored by others they are so-called cone representations and so here we associate with each concept a single point but actually that point is going to represent a region that spreads away from the origin in all of the dimensions so it sort of represents a cone that spreads away from the origin and so concepts with embedding closer to the origin cover a broader region and the further you get from the origin it makes a smaller cone that represents more specific concepts and so you can represent a region you can have asymmetric distance calculations it's closed under intersection because everything that's both an herbivore and a mammal would be captured by a dot at this intersection exactly but it's not it's not disjoint I can't capture things that are disjoint because far enough out there everything overlaps eventually so there's a probabilistic version of this so here I have a rabbit and a deer but this intersection here represents the region of being both a rabbit and a deer at the same time which seems, well, that just seems wrong and furthermore it gets even more ridiculous because if I say asked well what's the probability of being a rabbit conditioned on being a deer well what does that look like in Venn diagram fashion that's like well you take the volume of deer sorry you take the volume of the intersection and divide it by the volume of deer and actually the volume of the intersection is a pretty large proportion in comparison to that so it's like I don't know what does that look like almost like a third or more than a third so again this just seems wrong the fact that we can't represent disjoint things so with that in mind we've been working on associating each concept with an indimensional box in space they capture a region one box can contain another box there are asymmetric distance measures between them they can represent disjointness and also boxes are closed under intersection the intersection of two boxes is another box furthermore we can train these boxes so that they all exist in they sit within the unit box which will represent the universe and have probability one and the volume of the boxes inside we can train so that their volume is equal to the marginal probability of that concept and we can train them so that the volume of the box intersections are proportional to their conditional their joint probabilities or their conditionals when normalized by one of them so the fact that these represent very crisp probabilistic models has us especially excited and let me step back to explain how and why what do I think of as like the two biggest advances in all of machine learning in the last three decades I think it's been compact representations of joint probability distributions I think graphical models on the one hand and secondly representation learning I think deep learning and this feels like it's a step that sits at the intersection of both of those we're learning representations sort of very much like deep learning but we have a very crisp formal compact representation of probability distributions at the same time all right so let me give you some intuition for what this looks like like how does training look like we'll have a bunch of concepts we'll initialize their box positions randomly and the training data will consist of giving it some marginal probabilities and some joint probabilities or maybe some conditional probabilities and then the model will do gradient descent gradually moving the boxes around in order to satisfy those training objectives and then settle some placement it's done what is this little demonstration data it's actually movie data from movie lens where each box corresponds to a movie and its volume is like the marginal probability that people like that movie and intersections correspond to the joint probability that the same person would like both movies so overlap between boxes indicate that the same kind of people would like both of these movies so the purple rectangle in the background is forest gump lots of people like forest gump the two reddish ones are lord of the rings which have high overlap two and three which have high overlap with each other there's some Disney movies off in the blue off to the left that also have high overlap with each other and the narrow bands at the bottom are some H.K.Hcock movies so this makes sense and when we compare the accuracy of learning models about this market basket problem or these overlaps using various alternatives to boxes we find that boxes are providing some accuracy advantages so what might be considered a nice default like some bilinear model it's getting about 83 here and here we're getting 89 with this box model okay so boxes are not perfect though and they do have some limitations and one of them is that if you have just one box per concept there are some valid probability distributions that those boxes cannot represent and one of them is one in which we say that each concept has equal marginal probability each of the pairs has some non-zero joint probability but the triple all together has zero probability and it is possible to have such a distribution but these boxes with their convex shapes just can't represent that so one way to represent that instead is to have a mixture of box models and essentially we've taken our universe and divided into two sub universes where each concept now has a box on each side and this given the constraints can learn the constraints well you can see on the left hand side the green box kind of like shrinking all the way to zero and like the weighted sum of those two combinations does yield exactly the desired attribute okay so like there are various different now let me talk about some different ways we could think about the dimensions of these boxes so far we've talked about the dimensions as all corresponding to a single box in n dimensional space so here are four dimensions that I've just drawn here separately and this would then correspond to some four dimensional box so here's a Tesseract that's just representing that but of course we could also maybe we could consider dividing the dimensions such that two of the dimensions correspond to one box and then two of the dimensions correspond to some other box in a different space and note that because two boxes in order to overlap have to overlap in all of their dimensions you can think about the dimensions within one box and the way we calculate overlaps is being like calculating a conjunction if you're not overlapping it just any one of those dimensions then you don't overlap at all right but given that we do it like a sum or a weighted average amongst these two different box models these act more like a disjunction and so by setting up boxes like this we're able to really in a very native way represent disjunction of conjunctions which I think of as logically speaking a very powerful way to be working we can also of course have just represent a whole collection of one-dimensional boxes and we've been doing some work there as well and so now I want to describe some ways that we've been trying to apply boxes and another one is to do a common sense so I understand IBM is participating in the DARPA machine common sense program and we're very happy to be in that program also so we've been thinking quite a bit about common sense and a general kind of common sense we're interested in is being able to say take some arbitrary phrase including phrases that we've never seen before we put them into an LSTM which is trained to output the parameters of a box and then given two different phrases we can get two different boxes and look at their overlap in order to be able to calculate well given that you observe a gray haired man wearing a tie what's the probability that you're observing a man in a suit and that we would calculate that by the overlap of these boxes so we took a large number of images from Flickr threw away the images but just looked at the captions that have many different captions that correspond to the same underlying world truth took each of these captions and parsed them divided them up into pieces so that we could count various both fine grained and general and fine grained and more broad pieces so that we can basically count like how many images that had two dogs also had grass in them and by counting this we can get joint probabilities and marginal probabilities and then train this LSTM to predict that so here's some example outputs from this model let's see here given that you observe holding an instrument what's the probability that you're in the basement well that's pretty low not impossible but it would be unusual for somebody to have an instrument in the basement if you're watching a performance what's the probability that you've got a group of people that's pretty high if you've got an adult in a dance store what's the probability that he or she is wearing clothing well that's pretty high but not one and then here's one it's just absolutely entailed and that makes sense as well and what we find again is that in comparison with some alternatives of the box metal methods do well okay I have five minutes left I think I should be able to finish this up so as a next step and we've been very excited to begin to think about what would it mean to do I mean what I've described so far is a bit like shallow neural networks just like corresponds to just word vectors or word embeddings but what we'd really like to be doing is to be doing deep learning instead and in the way this is motivated by the example I gave before we were given a sentence this was represented in terms of vectors as it went into the LSTM which is all you know LSTM's work on vectors right not on boxes and we just output a box at the end but of course we'll be nice to capture this notion that well a man is a concept at a certain level of granularity and that there are other concepts and we know how they relate to each other using the really nice box semantics that we have and then to have some sort of model that can operate on boxes and they would output a box and that seems like it would give us some better capabilities we could start with boxes here I suppose and then try to turn these boxes into some sort of like sufficient statistics of or some statistics of boxes that could be vectorized and stuck into the LSTM but there's just so many different possible statistics there that would try but probably fail to capture the geometric properties of these boxes that I'm not optimistic about that working so what we'd really like is some mechanism for doing box to box transformations here and so for that we probably want something that corresponds to multiplication and addition just like we have for vectors and so what we're looking for is what sort of operations would form a commutative monoid on boxes that would be closed under intersection be associative, have an identity element, things like that so for multiplication it seems like intersection is a reasonable choice and there certainly is the identity element here right just like the full box here gives you back the input there's some question about what to do when two concepts are disjoint what is their intersection well one thing you can do is calculate in a way like the negative intersection like the space between them is like a negative box in space and that negative box as a center and you could just say that well the intersection of these two things will be a zero width box at the center of that negative box and that's actually, there are other alternatives but that's what we've been thinking about lately so we also need a way to do addition so we've been thinking about this as almost like these represent two different distributions what does it mean to add two different distributions there's like a weighted sum of those distributions so in a way that looks like this but we want to represent it with a single box in the output so we'll find like the single box that best approximates the weighted average of the two input distributions and one way that we could think about getting that single box that does this approximation is by taking an average of side lengths and centers that's okay but it seems a little bit odd that this very small box has just a big and influence on the center of the output box than this large box does so maybe we should do a weighted average instead say a weighted average of the centers in order to get something like this and we have to think, let's see here and then let's see, what did I want to say here and I guess then the identity method here is just the zero width box we'll get you back the other input okay so it's just some things that we can do here so our parameters will then look a lot like the parameters of a neural network they set it in like a matrix but instead of a matrix of numbers it's like a matrix of boxes and so it's nice just to think about what do these operations look like so one is can we just like you can do in matrix algebra have some setting of the parameters that just produce a match of the output take the input and make the output identical to the input and so it's sort of just like with matrices, we, regular matrices we can define box parameters like this which will exactly yield our inputs that seems like a nice property we can pull out a single box with something like this that pulls out this box to match here we can have a one hot embedding that says well just pull me out one stripe of my parameter matrix by turning on this box and zeroing out all the rest that works in the way we would expect it might be another nice property to have who I really have to wrap up is this notion that you also have in deep neural networks in which one stripe of the parameter matrix kind of corresponds to a prototype in input space and if the input is near that prototype then the hidden unit corresponding to that stripe will light up strongly and so that's a property that we would like so here's an input, here's some parameter layers that actually match this exactly so in a way this layer is a prototype that matches this exactly so what would that look like so I guess first we would do the intersections here and get this and then we would do the weighted sum together here which yields this but this is not maximally active it's not filling the space so it's a little bit broken so one thing that we could do is say well you matched exactly the layer parameters here and so let's just it's almost as if the layer parameter here let's treat those as if they were the full universe in each one of their spaces so we scale up each one of these things according to the layer parameters here but this has some problems which I think I've run out of time to explain with apologies but there's a very nice alternative that Michael came up with in which we sum together the layer parameters use this shape to scale up the result of the dot product and this gives exactly maximal activation for a match of the prototype and also matches other nice properties that we want for the rest of the model to work and I'm sorry that I've lost a lot of time to explain this so these deep models for boxes we hope will allow models to make use of relations that are difficult to model with vectors alone they'll be helpful in modeling multi-relational knowledge bases we're hoping that in addition to providing some probabilistic semantics deep in the middle of the network that they may then also be more interpretable although we have yet to see that it's just one of our hopes and with that I think I'll close without time to talk about my last musings and I'll take questions thank you so much yes yeah I think so negations are tremendously I think difficult both in symbolic to do reasoning with in symbolic logic as well as in other places and I can't claim that we're poised to do much better here you can clearly say what region of space corresponds to a negation the shape of that negation is not a box itself so that said you can condition on negations and still often efficiently calculate the volume to get an answer to a probability and what that means is that you're going to calculate the volume of something that's not exactly a box shaped but it's still so simple that it's pretty easy to calculate the volume of it so I think that's a partial help but there's more to be done thank you for the question I think so too that's a wonderful insight I've also talked with some people in machine translation who have said that one of the common errors made in translation is that the word produced in the new language is related to the input word but somehow was at the wrong level of granularity like the input said athlete and the output said runner or something like that and it was just too specific and so because the vectors weren't directly capturing a notion of granularity it had a mess here and maybe something like this could be helpful thank you I'm so glad that you asked this is a common kind of question so it's sort of the case that you can represent the boxes in terms of some vectors and in fact we do internally represent them as the position of the center of the box and each of the side lengths but the magic of what goes on here is that kind of geometric reasoning that's done in each one of the processing steps which is not equivalent to just doing say products or other typical vector operations and so that's where I claim the improvement is thank you this is another great question thank you so much for asking there has been far too little work on how in knowledge basis to represent things facts that may change over time Gerhard Weichem is done a little I'm very happy to have a new PhD student who just finished his master's with a part the Taliq Dhar at IT Bombay who's done a bit of work on this with him actually we're looking at trying to make some next steps let me just say before I run out of time one brief an approach that we're now working on that I'm quite excited about that builds on boxes is to take one dimension of the box and say that's a time dimension and then boxes can represent extents in time and they can represent an extent during which something was true and then the areas in time in which it was not true now that doesn't directly I think answer your question about how do we say that I think that there's a lot more that I could say about your question but apologies I don't think I've fully answered your question but I think I've run out of time thank you for your questions