 Hello, I'm Brent Halpern, Scientific Director of AI Ryzen's Network, and this is our weekly seminar series. Today we are lucky to have Amrita Saha from IBM Research in India with her ACL 2019 paper, Complex Program Induction, Forquering Knowledge Basis in the Absence of Goal Programs. Amrita is a research engineer at IBM Research since 2012, and she has a degree in CS from the IIT Bombay also in 2012. And without further ado, Amrita, take it away. Sure. Thanks for the introduction. So as you talked about, the title of the work is Complex Program Induction for Querying Knowledge Basis in the Absence of Goal Programs. So it's a work done in collaboration with Indian Institute of Technology Bombay. Okay, so first let's understand what question answering is. It's a core problem in interactive AI, almost every one of us have probably come across this problem in various flavors. It can be question answering over unstructured documents or knowledge graphs or both, or in different modalities like text, image or speech. What we are working on in this particular problem is question answering over knowledge basis, especially with text as the modality. Where we look at questions which can have multiple levels of complexity, like for example, what is Obama's birthplace? Or who voiced Meg in Family Guy? Or how many countries have more rivers or lakes than Brazil? So these questions have different levels or flavors of complexity in it. So first let's understand what are the state-of-the-art ways of handling such question answering, especially in the neural space. So if we have a question, let's say what is the capital of India, one of the state-of-the-art neural network is called this memory network. What that would do is, first it will encode this query into some vector, let's call it Q. And then it will find a memory which is relevant to that query. So memory is a sub-graph of the knowledge base which is relevant for answering the question. It has all these tuples like India has capital Delhi, India has Prime Minister this. And what the model learns is really an attention over these memory entries and that attention at the end computes and finally gives out this answer. And let's say it gives the answer in Delhi. Now let's understand what are the limitations of such a network. If I give a more complex query like how many rivers originate in China but flow through India. So here there are multiple levels of issues. So one is we have to understand what is the sub-graph of the knowledge base that is relevant for answering the question. And sometimes the sub-graph might become very huge because of the kind of question. And the second is a simple attention over the memory entries will not suffice. So first problem can probably be solved through a better entity relation or type linking in the query. And we kind of ignore that problem in our current page. So what we look at is how can we now perform operations over these memory entries and not just look at a simple attention over the memory. So in other words, let's say I have this complex question how many rivers originate in China but flow through India. Can I break this down into a sequence of steps? And these are the complex operations that I'm doing over sub-graphs of the knowledge base. The first step is find the set of rivers that originate in China. The next step is find those rivers which flow through India. The third step is finding the intersection of these two steps. And the fourth step is actually counting the set size. And this gives some final answers, say four. So what we did there was really a modular style question answering. So let's understand what is a modular style versus our end-to-end neural model, right? So traditional neural representation learning really involves you have an input. There is a model which learns an end-to-end task through a complex sequence of nonlinear functions. And there is an intermediate input, intermediate output that is interpretable. And the model finally spits out some final answer. And it's not always easy to understand whether the model is just simply copying or memorizing. Rather than this modular paradigm of learning, what the model is doing is given an input, the model is modeling this complex task through a sequence of simpler tasks. And the input is now taken through module one, module two, module three, module four to the final output. But as a by-product of this modular approach, you also get some intermediate outputs. So you get some intermediate output, one output, two output, three, which are actually interpretable by users as well. So this kind of modular approach is called neural program induction. So let's understand what this neural program induction is. So it was first introduced in 2016. And one of the classic problems it tried to approach was this bubble sort algorithm, where it launched this bubble sort algorithm by seeing the stack execution of that algorithm. So for example, if I have the execution trace of the algorithm, which says that first you have to do reset, then you have to do an L shift, then you have to move the pointer to the left. And I just try to learn this algorithm by invoking this sequence of operations where each operation, the sequence of actions where each action is a operation on a particular set of operands. So essentially the model is actually learning a complex task by reducing it into a particular task. Now if we have to do the same thing for KB based question answering, so what would that really involve? So let's say I have a question, how many rivers originate in China and flow through India? So my first step would be to write the program. So let's say I first write an operation called generator set of rivers, which have origin as China. Then I generate a set of rivers which flow through India. Then I do an intersection between these two generated sets. And then I finally do a count over that set. So this is the step of having actually written the program. Then there's another step of executing the program. So you now see the knowledge base and the fact that you actually compute what is the value of A, what is the value of B, C, and finally you get what is the final answer. So what kind of problems have this neural program induction or NPI been applied in? So the kind of problems are either algorithmic like where you learn algorithm of sorting or adding numbers and or it might be even math word problems where it's relatively simple for basic operations or it might be navigating a physical environment like a grid world or it can be simpler question answering settings where you have to do question answering over a small database table. And it has also been seen on some kind of multi hop inferencing over large scale knowledge basis. But in all of these it has been tested in a very constrained kind of setting where either the gold program is available or some kind of program sketch is available like a pseudo code, what set of operations you want to call that is available but maybe not the instantiation of the variables at every step. So that is called the program sketch. So some kind of supervision regarding what program you want to call is usually available. And thirdly, even if it's in a question answering kind of setup, these do not look at the complex sequence of operations over question over subsets of knowledge basis. It's mostly like multi hop influencing over knowledge base. And this is why it's not very realistic and not very practical because in reality, you will not get gold program supervision in large scale. So you'll have to deal with probably only answers as your supervision or distance supervision. So what we aim is at complex question answering system, which can answer questions from a large scale knowledge base having millions of acts. And the question answered process should require a multi step program with around say up to 12 lines of code. And these programs will deal with complex inferencing like logical inferencing quantitative or comparative inferencing over knowledge base or in subgraphs of knowledge of the knowledge base. So I'll give more examples of what are these logical quantitative inferencing. And more importantly, here the gold answer will be the only soul distance supervision. And there is no gold program or program sketch available during training. So first let's understand what comes closest to this kind of work. So there is this recent work called neural symbolic machine. What that work does is again on KB based question answering. Let's say you have a question who is the president of the European Union. It can be decomposed into this kind of complex structure where it requires multi hop inferencing with some constraint. So first it does an entity annotation to get the entities which are there in the query. Then it decodes one by one each token in the program. So first it decodes a parenthesis, then it decodes operator called hop, then it decodes a argument called R1 which is really one of the entities and other argument which is called governing officials which is one of the relations. And this way it keeps on decoding. And once it decodes a closed parenthesis, it gets added to the key variable memory. So this way it keeps on decoding token by token the entire program. And if we just write it out, it will really look like it first did this hop operation. Then it did a filtering to accommodate the constraint. And then it did another hop for the multi hop interface. But what are the limitations of such a architecture? First it decodes the program in a token by token fashion. So it can possibly generate syntactically wrong programs. And also because of the same micro level decoding, it cannot incorporate high level programmatic rules or constraints in the process. And this can actually end up generating semantically wrong programs. And the only way it can actually prune out these bad programs is post generation. So it first has to actually generate the program and it waits a lot of explorations in doing so and then it prunes out a lot of these programs. So with this understanding, let's see what are the kinds of data sets that we are wanting to look at, right? What are the kinds of complex questions? So one of the data sets is a recently introduced data set called complex sequential question answering. It has around one million questions which are answerable from wiki data and knowledge base having around 50 million facts. And the main interesting part of it is it has diverse classes of reasoning and complex questions over these knowledge based facts. And the current state of the art models also had a very poor performance on it. So to understand what are the levels of complexity, it can be having logical reasoning based questions. For example, which river flows through India but does not originate in China? Or it can be having quantitative reasoning questions. How many countries have at least any rivers or lakes or how many countries have more rivers or lakes than India, which involves some kind of comparison. So this is one of the data sets that we showcase our model on. The other data set is called web questions SP data set. It's a fairly popular KB based question answering data set of around 5.8,000 questions. And here the kind of complexity is really in multi hop inferencing with sometimes additional constraints. So for example, if I have a question, who's the president of European Union 2011? I'll really be dealing with a multi hop inferencing dealing with multiple questions, multiple relations and also with temporal constraints and multiple temporal constraints. So now with this understanding, let's see what the model is really about. We call the model complex imperative program induction with terminal rewards. So just to break it down, it's on my to generate complex programs and the style of the program is imperative rather than functional. And the only distance supervision that the model gets is from some terminal reward coming from whether the answer was correct or not. And there was no gold program in terms of the supervision and during train. So in short, we call it Cypiter. So before we go into the Cypiter pipeline, let's first understand what a program induction pipeline will incorporate. So first you have a natural language query. For example, how many rivers are there in India. It goes through a pre processing component, which gives out the annotated form of the query. Specifically, it gives out what are the entities in the query, what are the relations, what are the types, and if there are any integers in the query. So it has annotated and it has given out that India has river are the different query annotations. And this goes into the programmer, which takes in the knowledge base and the annotated query to give out this generated program. So first it generates a set of rivers and then it counts the set of rivers for this question. And in doing so it updates the memory to add new variables, which is then passed into this interpreter, which takes the knowledge base, takes this program and executes the program on the knowledge base, gets the answer, compares it with the gold answer and gives the reward back. So to set and clearly what we are doing here, we are not doing anything regarding the preprocessor. The preprocessor is an oracle preprocessor and it gives out the gold annotation of the file. Our contribution is only in the part of the programmer and the interpreter. And this is what we call a sepitor. So first let's understand what are the core components of sepitor, right? So what is the program of sepitor? And program is defined by operators and the variable type. So let's first understand what are the variable types and the operator. So variable types can be either KB artifacts like entities, relations or types, or it can be basic data types like integers or bools, or it can be more complex data types like sets of entities or map between certain entities and another set of entities, etc. Now in terms of operations, there can be again multiple types of operations. It can be either set operations where you generate a set or you perform operations over sets, or it can be map operations where you generate a map or perform operations over it. Or it can be some kind of selection operations like we do in SQL where we select certain subsets which have certain characteristics. So to give concrete examples, so a gen set operation would be an operation which generates a set having a specific entity, relation and type. So if I want to generate a set of rivers which flow through India and have gen set India has rivers. And if I want to list down for every country in the knowledge base, what are the set of rivers? I will create this gen map, which maps for every country. What are the set of rivers for interest? And if I have the selection operation, I can select entries on the map, which satisfies some given condition. And I can also have verify operations and other operations which indicate whether the process has terminated. So what are the kind of types that Sipitar has? So it can be number of operators. What are the number of variable types? How many variables of each type you want to store in memd? How many arguments are there for each operator? What are the dimensions of the key embedding? And how many operators and argument variables you want to sample? Or what are the maximum number of time steps that you have in a program? So next understand the building blocks which are there in the Sipitar pipeline. So there are three kinds of matrices. One is the usual set of embedding matrices, but in a slightly different flavor here. So what it does is really embed the fixed vocabulary of operators and variable types. So for operators, we have two kinds of embeddings. One is the key embedding and the value embedding. So this kind of key value embedding has been applied in various settings like in memory networks, wherever we need to do some kind of addressing or looking up. So there the key really embeds the addressing or looking up scheme. And the value embeds the semantics of the actual vocabulary entry. Here in this case, it can be the semantics of the operator. And additional to that, we also have the variable types and the value embedding corresponding to the variable types. So we don't need key embedding corresponding to the variable types because there is no looking up request. So the next set of matrices is the operator prototype matrix, which stores the static variable information for the different arguments and the output for each operator. And the third matrix is the memory matrix, which is query specific. And it's a scratch memory for storing new generated program variables as and when they get generated. And there also we need some kind of looking up of the variables. So we need both key and value embedding. The key embeds how the addressing scheme for looking up and the value embeds the actual value of that memory entry. And currently we need something like an attention map over the program variables in the memory. So this all of these will be required when we are actually sampling values, variables from the memory. So now let's look at the sepital pipeline. So let's say we have a query, how many rivers originate in China and float to India and Bangladesh. The first thing which happens is the query gets encoded into a vector and that vector gets fed into something called an environment argument. So this environment is very similar to the environment in a reinforcement learning agent where it really encodes the external environment that the agent is facing. And that environment vector here in this case for time step one gets fed into another argument which encodes the program state or the state of the agent that point. And that gets fed into a network called the operator sampler and the operator sample sampler first samples the program. And that program gets fed into a network which looks up the operator prototype understands what are the argument types and what are the output variable types. And this gets fed into another argument variable sampler which samples the argument variable and also find the sampling scores. And this generates the actual variable corresponding to that variable type and generates the key and the value embedding of that variable. And this variable is then written into the program into the memory. Next, the value of the variable which gets generated that is actually fed back into the environment RNN to update the state of the RNN and that further updates the state of the program RNN. And a similar sequence is followed. And similarly for time step three, four and till the end of the process. Now let's look at an actual example. So for this given query, first the operator sampler generates a set of rivers which flow originate in China and populates the memory with this particular variable A. And the next step generates another set of another set which is set of rivers flowing through India and that generates a new variable B. And all of these keep getting updated into the memory. Similarly it happens for the next set of variables which is the rivers which flow through Bangladesh. And finally the objective would be to do an intersection between the set variables which have been created till now. So that way it keeps on generating new variables, populating the memory and writing out different programs. So after having written the program, so let's call this program as an action sequence A0 to T. How does the training happen? So this action sequence along with the gold answer goes into the interpreter gets the reward which is fed back in. And the objective is a standard reinforce objective where you maximize the expected likelihood of these action sequences, taking into consideration the reward. And in our case the reward was really a jacquard similarity between the predicted and the gold answer where these were mostly sets of entities or sets of integers or booleans. So another trick during training is to do the search where we can get multiple feedback for each of these expensive forward passes. And that really means getting or sampling multiple operators or argument instantiation at every length point. So we might have to sample say NP operators at the first time step, NP argument instantiation for each of those operators. And finally subsample K action out of those NP cross NB action. And that is the beam sampler. Now at time step two, everything gets multiplied by K, but at the end you still sample K action. And this way it keeps on continuing where at the end you will get K possible programs. So with this let's understand what are the challenges. So one is at every time step we are having let's say 20 possibilities of operators. And if I have only three variables of each type and three arguments per each operator, we can have 540 possibilities of argument variables along with the operators or 540 possible actions. And that would lead to around 10 to the power of 19 such possible programs if you consider number of times steps to be 7. So essentially it's a exponential program space and compounded with that you also have extreme rewards partially. So only a few possibilities can lead to positive reward and specifically in cases of integer or boolean answers, there is no partial reward. So together these two problems actually lead to high variance and low minima local minima issues, which cause learning instability. So how does Jupiter overcome these challenges? So one is searching the space intelligently only over the semantically correct programs. The second is mitigating this rewards partially while learning in a curriculum learning fashion. And the third is biasing some of the beans in the bean search towards good programs or longer programs. So in order to do this to the to do the first thing where we have to find the semantically correct programs. We incorporate the schema or task based constraints or generic high level program and constraints or styles. And for the rewards partially we incorporate something called additional auxiliary rewards which also indicate whether the answer type is correct. So it's possible that the answer is wrong, but the model should first learn to generate an answer of the correct type or meaningful answer and then only it should learn to generate the correct answer. So this is essentially in the flavor of curriculum learning and we have different ways of biasing the model to do so. And in the bean management, we also try to prune certain beans which end up in bad answers, do some length normalization to make sure that very short beans should be penalized. Or penalized beans based on how many knob kind of operations have been there in those beans and also do some stochastic exploration and regularization. So let's see some of the examples of what I just said. So let's say if I have to sample operators into intelligently. So I have this question and I have a set of operators. So I can look at the program schema and I can find out which of these operators will have will have valid operands. So for example, if I have only one single set variable in my memory, I cannot call the set union operation because it requires two distinct set variables. Or if I have only one entity relation type in the query, I cannot generate more than one gen set variable with that. So this also links to other kind of high level programmatic rules. So it cannot repeat an action cannot repeat lines of code. So that will lead to very degenerate kind of programs. So all of these actually add attention maps over the operators trying to reduce the space of search. Further, you can have other programmatic rules where you decompose the program generation into phases where you first try to have a retrieval phase where you generate very gathered information from the pre-possessed input variables like entities relation type and you generate variables like set or map. And in the second phase, you operate on these generated variables. So for this, what we do is we have the program state that goes into a network called the phase change detector, which is the phase of the program generation. And depending on that, the attention map is found. The third is attention over query words. So you have certain words like they and or it probably indicates to certain kinds of operators. So that also gets added into the attention map over operators. Lastly, we also have this biasing of the operator at the last 10 steps in order to generate answer of the meaningful type. So to do that, what we do is from the query, we have this rule based answer type detector. So for this answer, we understand that the answer is going to be a integer type. Then we do a reverse mapping to understand what are the operators that can lead to the integer type answer. And there we apply a trick where given the program state, we pass it through another network which determines whether the program is going to come to terminate or not. And if the program is going to terminate, it actually incorporates this bias into the sampling and that becomes another attention map. So together the final attention map over the operators is computed as a product of these attention maps. And that probably gives at this time this gen set as the operator. Now let's look at how to sample valid operands. So here also we can do attention over query words that gives one kind of attention map over the argument instantiation. The next important part is to actually look at the task. So the task is based on the knowledge base. So whatever we do, it should be consistent with the knowledge base. So we look at only those argument variable instantiations that are consistent with the KB. That gives another attention map. Thirdly, we again look at these high level programmatic rules which say you cannot repeat an action or you cannot repeat a line of code. So this way we get again multiply these attention. We get another parts attention over the argument instantiation and that can be used to sample the argument. So all of these they're actually trying to reduce the space into a small subset of only semantically correct or meaningful programs. So with this understanding, let's look at the results. So first the set of results we have on this web questions as the data sets. The settings is the setting is you have the gold query annotation where you have the entity's relations and sites given as input. And we cannot compare with directly the papers which have worked on web questions as people because usually they work in the noisy setting where they also discovered the query annotation. But what we compare is this rule based question answering model where we have this human annotated semantic path form of the query. So every query in web questions SP comes with this parse form where it says what is the central topic? What is the central entity? What is the inference chain? What is the chain of relations to be applied on it? And what are the exact constraints? And if there are temporal constraints, how they should be used? So what we do is we give this parsed form to this rule based question answering model, which applies a heuristic to get to the answer. However, note that in the case of Jupiter, Jupiter has to actually learn this rule and this rule is really hard coded or handcrafted in case of the rule based model. So what we see at least in the different kinds of complex question answering in web questions SP is that in almost all of the cases, Jupiter performs somewhat better than the rule based. And in only one of the instances it performs somewhat quite worser, but overall it performs slightly better than the rule based and all of this despite having to actually learn the rules. So that's an encouraging result. Now let's see what happens in the other complex data set. By the way, all of these results are in the percentage F1 score. Next in the CSQA data set, here also the setting is the same. We have the gold query annotations where the entities relations and types are given. And we compare with the following baseline, which are the state of the art at least at that point. We have the new neural memory networks and we have the new symbolic machine. So one before going into the results itself, just the training and the test size of this data set. It's quite humongous because the test set side is around 150,000 instances overall across the seven types of questions. And what we see if you compare the results of these three models is clearly Jupiter performs significantly better than both of these two baselines. And yeah, in one of the instances memory network performs better, but let's look at these numbers more closely. So what happens is the superiority of Jupiter is really more pronounced in the cases of more complex query classes. And it does perform better than NSM on symbolic machines across the board. And in one instance memory network performs better than Jupiter. And overall the generalizability of Jupiter is better. So some more qualitative analysis into this what happens with Jupiter in the case of relatively complex programs. It actually performs more than two to 20 times better than memory networks and around four to six times better than NSM. And the best performance is really achieved in the hardest, one of the hardest query classes, which is the comparative reasoning where it performs almost 89 times better than NSM. Rather in the other hardest class, which is the comparative count question, memory network actually outperforms Jupiter and almost it looks like a discrepancy. And one of the reasons behind that can be that in the case of count questions where the answers are usually a small integer value, there's a lot of bias in the answer distribution. And these kind of memory network models, which are heavily prone to memorization and routing, they can easily get biased with some kind of very steep distribution like that. And it's possible that just because of that memorization, it looks like it's doing good, but it may not be really applying the logic of getting to the correct answer. And that's actually clear by its terrible performance in the class of comparative questions where it performed really good for comparative count. It actually performed really bad for comparative. So that's one of the explanations behind this discrepancy behavior. And lastly, we do see better generalization of Jupiter. It performs around two to five times better than both of these baselines when trained simultaneously over all the question categories. Okay, so let's look at some ablation tests. So we talked about various different aspects of Jupiter and how we incorporated, like how we searched over only semantically correct questions. So let's see how each of those aspects contribute to the performance of Jupiter. So for this, we took one of the hardest classes of comparative questions on which Jupiter has some decent performance. And the original performance where all the features of Jupiter were taken into account gave to 15.123%. But what we did was we took away one of the features one by one from this original setting. For example, we took away beam pruning and that resulted to around 5% drop. We took away separately in a different model. We took away this auxiliary reward where we don't have this curriculum learning fashion and that also dropped by around 5%. But the model gets more affected by some of these other features where we bias the model towards the last operator to get more meaningful answers. Or we get the valid variable instantiation or we do this phase change where we took two types of phases of the program generation. First we try to generate the variables, then we try to consume the variables in the algorithm. And also these action repetition where we ensure that we don't come up with degenerate programs. So some of the example outputs of Jupiter and in comparison with NSM. Yeah, it might be a lot of data in the table, but an overall summary is that at least for some of the complex questions like logical quantitative and comparative. Clearly NSM fails to understand the semantics at some time. For example, in logical in the logical question, it did not understand that it should do an intersection and not a union. But the more striking differences for the higher order complex classes where you have to do quantitative and comparative and clearly those programs are even longer. So in those cases, it actually not only does a mistake, but it does a very dumb mistake and it comes up with either a wrong answer type altogether or comes up with a very meaningless program. So at least the kind of errors that we see in Jupiter are more intelligent. It might be able to get to the right answer type sometimes it might be able to have a similar looking as a sensible looking program trace, but that kind of thing we don't see in case of an open. So that is one of the takeaways and to summarize this again. So, Jupiter shows proficiency over NSM in terms of first generating syntactically correct programs by design. So it's by design because that's how the action space of Jupiter is defined. So instead of decoding token by token, Jupiter decodes every line of program at one go. So it cannot make syntactically syntactic mistakes. It also generates semantically correct programs by incorporating these programming rules or constraints and also these tasks specific or program problem specific constraints. And thirdly, by doing all of this, it's actually able to explore the search space, even though it's an exponential search space, more efficiently using actually half the beam size of NSM. And some of the future directions is, of course, this is the first like we just cracked the surface of this complex question of using this, but there are various questions and open problems around this. So firstly, Jupiter did not look at the query annotation and what happens if the query annotation given all the input, right? So how do we handle that kind of noise in that program input? So this work we did as a follow up to this and it got published in HTI. And there are other challenges also. For example, Jupiter does not really play out this question answering as a strategic problem or a strategic search problem. So are there ways by which we can actually incorporate strategy in this program industry? Or how do we generalize into diverse goals when we have to train on multiple types of questions together? So what we realized is even if we did train on all questions together, it was definitely quite a striking performance drop than when we trained Jupiter in dedicated settings with different kinds of questions separately. So there is scope of work in this generalization aspect as well. And we can look at this as some hierarchical reinforcement learning problem or other techniques as well. And lastly, how do we have similar kinds of Jupiter like architecture for complex question answering over other settings like unstructured document or image modality? Or like visual question answering or video question answering or other tasks on it? Yeah, so this concludes the talk that I had. And yeah, I can take questions at this point. Okay, Amrita, thank you very, very much. Reminder to folks, if you're IBMers, this is an open talk, so make sure your questions have no confidential information in them. If you would like to ask a question, please unmute by clicking on the little red microphone at the bottom of your window. And do we have any questions? So one that I would have, Amrita, is program synthesis is not a new topic. The areas I've seen in the past are almost always logic generation and trying to go through a search space of logic and rule based programs or set based programs and use that to generate a high level program. So I mean, you may have answered this in the future directions here. Have you done any comparison yet with the, I call them classic techniques that makes me sound even older than I am? Yeah, that's actually right. But it's a very valid question. And we also, what we try to do was compare at least with the rule based techniques. But what you're saying is, I think in the space of program synthesis, there has been a lot of work and they may not be just constrained to the kind of rule based models that we used. Yeah. Yeah, so we did not do that, but that really opens up new ideas that we can focus on. I think it might also open up interesting other applications. Besides generating programs, I could imagine generating test conditions or generating verification conditions or even just double checking a program that's being run for the first time and looking at the answer and saying, gee, this doesn't seem to make sense even, you know, just as somebody looking at something looking over your shoulder. I don't know. It seems any other questions. Okay. I'm reading. Thank you very, very much. Appreciate the presentation. I found it very, very interesting. Thanks to everybody for attending. There will be no seminar next week. The next seminar is July 10th and the announcement should be going out if they haven't already. So again, thanks everybody. Thank you.