 So our first speaker is Salim Rukos from IBM Research AI. Salim is an ACL fellow and an IBM fellow. He leads the mastering language group at IBM. His group has pioneered many classical NLP methods like statistical parsing to machine translation and even coming up with the blue metric, which has won the Test of Time Award at the NACL conference last year. And yeah, I mean, I can go on and on about Salim, but I would just let him speak about the overview of IBM's effort in question answering and summary parsing. OK, thank you, Avi. I'd like to first thank the organizers for organizing the workshop and doing the best thing possible to invite me. Thank you. So I'd like this to be interactive. We are a small group. So please ask questions if you would like to. So we have an initiative at IBM to push natural language processing research. We call it mastering language. And it covers many topics. This is a subset of the topics. You can't list them all, but these are important from a commercial point of view in the shorter term. So identifying user intents for chatbots, labeling legal documents with obligation, control. So these are identifying important things in a legal document. Extracting entities and utterances to improve the dialogue. If you say I want to get a hamburger with lettuce and tomato, getting the components correctly. Extracting mentions, coreference, and relations for text mining from long documents. That's typically for analysis of for media monitoring called center analytics. It's under the rubric of information extraction. Answering questions from simple FAQs, facto it's to richer multi-hop questions for both enterprise corpora and knowledge bases. Integrating knowledge with semantic parsing and reasoning. This is sort of a longer term research for us, trying to see if we can use knowledge and reasoning into improving NLP systems. We are trying to do that in the context of question answering. And obviously, there are many other areas. I'm not listing here. And equally important is these techniques have to work and scale across languages. The other thing that I want to mention is the NLP conferences are growing geometrically or exponentially is the same thing, but exponentially sounds better. Probably marketing point of view, so I have to use that word. Even though it always bothers me. And the other comment is the field is very dynamic now. It's very active. Lots of people are working on it. Even people who never did NLP are doing NLP because of the deep learning, made things a lot more straightforward. So things are happening at an incredible pace that sometimes I hear some people say, oh, no. Not another result. You can't keep up. It's very hard. But I think it's an exciting time. And the other thing I want to make a comment on, you have to keep running. I mean, that's always been the case. It's just a bit harder these days. All right, I'm going to talk about three topics. How much time do I have? 45, right? Starting at? I started at 9.05. Oh. Oh. OK. Just for me to manage my time. So three topics. One is semantic parsing. I'm going to talk about abstract meaning representation. I know the other speakers are not here. And this afternoon we're going to hear more about semantic representations. So we'll have a discussion when he presents. Then on question answering and reading comprehension. And finally, I want to close with a brief overview of a leader board we're setting up for reading comprehension for the domain of IT support, which is an enterprise domain. We need to, from an IBM perspective, solve problems for enterprises. And they have unique challenges, I would say. All right, abstract meaning representation. Who's familiar with abstract meaning representation? Can you raise hands? OK, very good. So the idea is it captures some of the meaning of a sentence. So all these sentences here, the boy wants the girl to believe him. The boy wants to be believed by the girl. They all have the same AMR representation, abstract meaning representation. So that's the precision to which it captures things. And it can be presented as a graph. And I'm going to show you the graph in the next picture here for different sentence. The boy wanted to visit New York City. And what I want to highlight here is that the boy is the arc zero or subject of want and visit. So you have the same entity having two arcs coming into it. And so that captures implicit things or coreference sometimes, as you will see in some other example sentences. So this is the first point I made, multiple parents of a node. Also, there could be extra nodes in the graph. There are named entities, certain types that we inject into the graph. They're not explicitly presented in the sentence. They're implicit. So for example, the fact that there is a type city, which has the name as opposed to nominal. And then the specific New York City is the string. So there'll be extra nodes in the graph that don't have explicit word markers. And finally, there will be extra words in the sentence like the and to in this example. And there are more of these. Typically, the function words are never in the graph. They're really in the arcs implicitly. OK. So we have been working on this. And we've developed a parser a couple of years ago called Stack LSTM AMR parser. I'll describe the Stack LSTM in a moment. It's a transition-based parser, a sequence of actions. And it's essentially a sequence-to-sequence model from the words to the actions. And the input is plain text. The output is the AMR graph, which you can compute from the sequence of actions. So I want to take a moment to talk about the Stack LSTM. This may be small. But if you look at the picture, push A means that the head of the stack is A. So we push something on the stack. Then push B, then B. Then push C, then C. Then pop. Now the head is B. It goes back. And now you push, you push on a new path. So this allows you to share histories based on your stack operations. And that turns out to be a useful thing from a modeling point of view. I'm not going to get into the details. There are at least two posters on the subject, one here. And I saw the other one. I cannot see it from here. The angle is too sharp, where we talk about much more details about the parser. But fundamentally, there is the stack on your left, where we look at that. And there is an LSTM that creates a vector to represent the top of the stack. There is the buffer, which is the input sentence that you have pushed in, what remains of the input sentence. And then the sequence of actions that you have taken so far. So these three LSTMs will create a vector representation that's used to compute the probability of the next action. And there is a list of actions here. But if you are interested, you can talk to the posters, poster presenters. All right. Now the key limitation, I would say, or the key assumption in this parser is that there is a notion of alignment between words in the sentence and nodes in the graph. So when is AMR unknown? You cannot read it, I know. So this is how you capture the question, AMR unknown. Then link and square is a name. And completed is the fundamental predicate at the top. So there are words in the sentence that will align to some of the nodes in the graph. And there's another example below. I'm not going to spend time on that. And the way it works is given a graph and a sentence, you need to get an algorithm to give you node word alignments. And once you have the node word alignments, then you can create the sequence of actions that will create that graph using those alignments. And now the problem becomes, given the gold actions, gold, quote unquote, because there could be errors, right? The alignments may not be correct. But we call it gold. So the gold actions with sentence are the sequence-to-sequence model that's the parser. So obviously the quality of the word alignment will have big impact on the performance of the parser because the data will be, you missed everything. No, I'm teasing you, I'm sorry. So I'm talking about AMR as you might expect. I figured you might like that. OK, so given the sequence of gold actions and the sentence, we can build our model. So there are many things we worked on and the poster will talk about some, but I want to highlight that node alignments, improving the quality of the alignments had a big effect on performance. And using reinforcement learning also had a big effect on performance. And we did also the use of attention to have soft alignment to help with deciding how to predict the action. And we also switched to BERT, which obviously is a stronger representation of the words. And what you see here is the performance of our baseline system from the original parser. Then as we added improved alignments that gave 1.3 points, then as we added attention, get another 1.5 points, then added BERT, which gives 3.3 points, then added the reinforcement learning gives us 75.5, which at the time was state of the art because we were comparing to the first paper, but then at that same conference, somebody else had an improved system. So we're not quite state of the art, but that's OK. That's the nature of the beast. I have been looking more carefully at the results in detail here. So going forward, I expect we will focus more on the SRL performance because it's pretty low. So there are components you can measure. Instead of measuring the overall performance only, you can measure some subcomponents. And I would say the core reference and the SRL are relatively weak. So we have a demo of this system. So I'm going to show you the output of the system on some example sentences just so that we see what happens. The boy wants the girl to believe him. We get something like want. So the boy wants believe, girl believe boy. Arc0 is girl, right? So that's good. OK, reasonable. The boy wants to be believed by the girl. Now the parser is unpredictable. It does a good job, actually. It gets it. It's boy wants believe, girl believe boy. That's good. Then the third sentence was, the boy has a desire to be believed by the girl. Unfortunately, here it makes a desire as the predicate as opposed to want. So clearly, the MR does not collapse sentences correctly enough. So we need to add a layer to merge these things when we have strong evidence for that. Then the boy is desirous of the girl believing him. Oh my god. And it's really bad. It misses the him and the boy being core reference. So there is a boy who wants to establish that the girl believes in he. But it's not the same he. Not so good. This one, actually, not so good. The boy's desire for the girl to believe him, pretty good, except for the desire issue, the desire and want. By the way, I think we have a long way to go to make these parsers robust. You can give them a little bit of different sentences and they make a mistake. So work to be done still on this space. Now we want to apply this for enterprise domains. We started on IT support. And as you might expect, IT support is very different kind of language than the AMR-3 Bank. Even though, I have to say, unfortunately, the AMR-3 Bank uses not so good English data, as some of us know, which is unfortunate. It's a historical mistake. I'm not going to name the organization that funded all this research. But nevertheless, it's general purpose news. And whereas here, it's more technical. Event reader stops reading events. State change to value too high. It's different languages, really. So we had to do a bunch of things. There is a lot of imperatives in the tech domain. People say, do this, do that. So the subject is implicit, ARC0U. We also added about a dozen concepts like you should FTP the attached save file. FTP is to be added. If unable to telnet, complete the TCP IP configuration. So there's about a dozen of those. It's not so bad, actually, from a predicate point of view. We added 17 types for things like software, product, file, hardware, computer, operating system. And we had to deal with the distinction between product names, acronyms, and abbreviations. You just need to know what these things mean so that to help you do things. So I'm showing here the IAA, Interidentated Agreement on the various data sets. So there is original AMR corpus on news and web from the LDC. And the IAA there is about 81%, pretty high. And the system that we have is at 75. So it's reasonably close. Maybe we'll get a few more points, but you're not going to beat the IAA, I don't think, by definition, because the test set is that accurate. We also tried to do at IBM some AMR annotation. So we did squat questions, squat answers. We didn't do a huge effort. They just wanted to see how it works. So our annotators were in the 70s. Then we did this on IBM support where we wanted to do a more serious effort. And still, they were still in the 70s. And you can see that the system performance is much worse on the IT because the domain, the language is different. So that is a robustness issue. So not only is it lower, 68, 65 versus 75, because of the quality of the three bank or AMR bank, there is also a domain issue. So this issue of domain adaptation is a focus item for us. So we did a couple of experiments on domain adaptation where we showed the performance with the baseline system. If you add 5,500 IT support AMR sentences, you really improve the performance from 55 to 64 or 65 with the enforcement learning. And we're doing now with Roberta just to see what the performance are. So I think we'll end up probably 66, 67, which is closer to the 70. So now we're investing into doing a second pass on improving the quality of the AMR three bank or graph bank. So to see if the annotators can be more knowledgeable of both the domain and the AMR. And I think they've become much better, actually. So I expect we'll have much better numbers next time we measure it. OK, that's topic one. Topic two, question answering. So you have a question. You have a corpus. You find typically you stage it like this. You do document search. Then you do sentence or passage retrieval from the top documents because you can't afford to do these calculations on all possible paragraphs or sentences. And once you find a set of passages or paragraphs, you then do reading comprehension or you find the answer span for the answer to the question. We have built systems. So we have a framework to plug and play components. And this would be presented at MNLP. But basically it allows you to plug components and to enable various groups at IBM to plug components and see if they can do things. So for example, there is a query reformulation module that goes into the search IR. Then the sentence and candidate selections are another component. Then there is entailment system to do additional re-ranking after the answer selection. And then potentially, as we have more systems, some additional answer-ranking model. So that top branch, I will show you the output of the demo of that system. Also in parallel, we are trying to see if we can use MR parsing, then converting to logic, and then use logical inferencing to answer some questions. OK. So to show the system in the context of a IT support DB2 use case, I'm sure it's small, but I'll read the question. When are secondary log files for circular logging freed? And the top answer from the reading comprehension is that blue delineated line. When the database is deactivated, that's the top answer with a score of 0.37. And the next answer with a score of 0.28 is better because it adds the conjunction. So it's when the database is deactivated or when the space that they are using is required by something. I cannot read it. For the actual log file. Thank you. Because I'm at an angle. That's hard. OK. All right. This is a very tough query. By the way, this is a subset of queries that do very poorly on search engines and things of that type. So just to give you an example where reading comprehension can add value. OK. Everybody's familiar with squat. Can you raise your hand? OK. All right. So there's a question. There is an answer. And they have three possible answers. And you measure the overlap between the system output, which is here, Santa Clara, California, and the three possible answers. And you get maximum F score of overlap. Then they realize that these systems are very brittle. I don't know if you're familiar, but if you play with some of these systems, if you change a little bit, things fall apart. So to improve the research agenda, they added questions that don't have answers. And the questions were on purpose computed to be misleading. So they are very close to the paragraph. The problem with squat is that both version 1.1 and 2.0 both have this bias of the turkers see the paragraph, and then they compute the answer, the question. And that's a fundamental, not a good thing, because there is a high overlap between the lexical choices in the question and the paragraph. But nevertheless, this leaderboard was, I would say, was seminal in my mind in terms of encouraging a research agenda and get everybody going on question answering. So it's really a fantastic accomplishment. So in the context of the unanswerable question, Victoria has about 60 full-time teachers, and the question was, how many full-time genitors does Victoria have? And the top system lose 20 points. You know, when you run on squat 2.0, if you build a squat to 1.1 system, run it on 2.0, you will not do very well. And the progress is shown here. So on the left is squat 1.1. This is time and the exact match metric. And the red curve is the best performance at that point. So that's the state of the art. And then the others are all the attempts. So it was a very active leaderboard, a lot of submissions. And the state of the art curve is moving fairly quickly. This is a span of a year and a half, roughly, even though it says two years here, because there's a couple more additional submissions. But most of the progress happened in a year and a half. And on squat 2.0, progress was even faster, I would say, because very quickly people developed techniques for robustness. So these leaderboards have been very, very effective to speed up progress, I would say. So Google recently released the Google Natural Questions Challenge. This one is very appealing, because these are questions from real users. So there is not that observation bias of seeing the answer. And it has 300,000 training questions. So it's a huge, large set. And the specific way they set it up is you are given one page. And then you have to decide what's the answer if there is an answer. And there is the concept of a short answer, which typically is three-four words, and a long answer, which is like the paragraph containing it. But not all questions have both, because some answers come from tables. There's no long answer. And there is some binary questions. Do birds fly or do humans fly? And there are questions that don't have answers also. So this is the leaderboard. So we have built a system for this. And we did a number of techniques. And there is a poster on this somewhere. And I would say the ensemble. And what I'm showing here are some of the results for one version of the system. Things change. A month later, everything is different. But at least a month ago, we were getting about two points on the short answers and 2.7 points on the long answers. That's what this parenthesis gives you. Data augmentation was also helpful at that point. Subsampling of the negative examples was also helpful. And we tried different pre-trained model, fine tuning, starting from squad, from squad 2.0. We have different techniques to do pre-training. So a variety of ways of getting initial models. And they typically helped, but they hurt on the long answer. I have to say, we have not done work on the long answer for real. We just have a convenient way of doing it, which may or may not be. I mean, we know it's not a good thing. So we need to improve it. But at this point, our focus has been on the short answer. And then attention on attention. And I believe Lynn will talk. Is Lynn here? Hey, nice to meet you. I have worked for Lynn for four months. Haven't met him before. So he will talk about it. And there is, as I mentioned earlier, a paper at E-MNLP, which talks about this system, which is a demo system, in that case. OK. So the leaderboard on natural questions is the same idea. So there is the ones that are in the square boxes, I guess. These are the state of the art at that point. And then they stay there until somebody improves on them. So you can see the short for the lots of submissions that are a little bit less than the state of the art. So this is the blue spot compared to the red dot. So our system, actually, at this point, is still at the top of the leaderboard for short answers. For the last six weeks, I guess, two months now, roughly. It's not as active as I thought it would be, but we'll see if that will change when the next conference deadline comes up. So I think then we'll see potentially some activity. OK. So you built a system, but now you go to an enterprise that have different kinds of questions, tech support, legal. So domain adaptation is the issue. So to try to learn a little bit about domain adaptation, we took MS-Marco, which is another reading comprehension task. And we extracted from it subdomains. And the domains were biomedical, computing, film, finance, law, and music. And this is how much data we extracted for each domain. So the biomedical domain is really large and unrealistic, 20,000 questions. You're not going to get that in an enterprise setting. And the others are more reasonable, even though I believe for enterprise it has to be more than the hundreds as opposed to the thousands. It's hard to read, but I'll try to read an example question for you. So what is the civil forfeiture law? And the answer happens to be the first sentence in that long paragraph, which is civil forfeiture laws represent one of the most serious assaults on private property rights in the nation today. So that's the answer in MS-Marco. So if you try the squad, it was not squad actually. It was trained on squad. I was going to say the squad system. So they tried, the group who did this work tried two different reading comprehension system trained on squad. And if you try those system trained on squad data and run it quickly on MS-Marco, the performance is 52%. If you train from scratch for the biomedical domain, which had a lot of data, 20,000 examples, you get to 66%. So clearly, all the learning on that MS-Marco stuff does not help, sorry, on squad does not help. But if you do fine-tuning, which is fine-tune on the, so you benefit from the squad system, but fine-tune it, you get 72%, and that's the best combination of variations on the same theme. So clearly fine-tuning is an important element of many domain adaptation solutions. We also looked at how much data do you need to do fine-tuning. And on the biomedical domain, we did 100 sentences, 200, 1600, and then 100% is all the data. So it's 20,000 for biomedical, and it's 2.6k for computing. And you can see that with the simple fine-tuning, nothing more sophisticated. I think more sophisticated approaches are needed. You are getting half of the way, let's say, with 1,600 questions on biomedical, compared to 20,000, right? So it's a significant reduction. And on computing, you're getting there by, it's a very unstable number. So I don't know if I trust any of these numbers, because they fluctuate, you know, 74%. But roughly speaking, 400, 800 questions seem to be a reasonable target here. The initial system is pretty good already, actually. So it's like one of those lucky things, I guess. So we are now trying to invest into developing methods that are robust, that will work for a variety of domains, and that hopefully can use less data, because I think most customers are in the hundreds as opposed to the thousands, in terms of question-answer pairs. OK, I'm going very fast. Yeah, OK. So finally, the last topic I want to talk about is we are creating a leaderboard for IT support to push domain adaptation research, actually. So it's very hard to find high-quality data, you know, question-answer data. And as I said, we expect hundreds of training samples in enterprise domain. So we need to figure out good techniques for domain adaptation. So we decided to explore finding answers to technical support questions. That's a significant place where there's a lot of use cases for companies. And the data set that we're collecting would allow us to do both IR and machine-reading research, even though most of the discussion today is on machine-reading comprehension. The big difference is in tech support, the questions can be complex. So sometimes I talk about the length of the question and the length of the answer. So squad and natural questions, they have relatively short questions, 15 words or less. And the answers can be short or can be long. This domain, you will see, has long questions. So the length of text is longer, so that requires a bit more modeling than we typically do. And this domain answers can be in tables sometimes. So we have about 840,000 technical documents. So it's called tech notes. These are like FAQ types of problem and solutions for a variety of IBM products. And we have both the HTML web page and we'll also provide the detect text formatted. So if somebody wants to do it from the HTML, they can. But we also make it easy so you can use the detect text also. So let's look at an example. So questions come with a title. In this case, Net Cool Impact 710, state change value being used by the omnibus event reader, long words, is too high. And then there is the body of the question. The value being used is a date and time in the future and as such is preventing the event reader from capturing the current event. So it's quite good description here in the body, more meaningful. And the answer is that region that is in red. And it says the simplest solution is to manually reset the event reader state change value via the GUI, stop the event reader, open it for edit, click the clear state button, exit the editor, and restart the event reader. I think the most important thing here is that there's a lot of words in the question relative to this answer. Let's look at another example. So here, the title is very brief, unable to uninstall Data Studio. And then there is a body which is quite long. We use Data Studio 311 with DB2 on Windows while trying to new version of Studio. We are able to install it, but unable to uninstall the existing 311, getting a JVM error. How can we delete it? And the answer is that little red snippet again. Please try to uninstall all products, including install manager. Then reinstall IM and Data Studio for 1.1.2. I picked shorter answers because I can read them if it's long or boring, but I will show you some statistics here. So the questions are, the median is about 30 words. And there are some long questions. There's a long tail. Some questions are quite long, right? 85 words or more. The answers are about 45 long. So 50% longer as a median. But again, it spans. It's almost uniform. So the way we obtain these questions and these answers is remind to support websites at IBM. And from those, as you know, forums and things like this, somebody asks a question and somebody, many people answer. And one of the answers is voted as being right by other people. So we focused on the ones that we've been voted as right as a source of data to justify the answer. Then we also required that post to include a link to a Tech Note page so that we can find the document. And then we asked our annotators to find the answer in the Tech Note. So we are more confident that these answers are correct because they've been voted by other people as being correct. Because remember, these are very technical questions and average humans don't know how to answer them. So that's the way we managed to get a vetted question answer pair. And these numbers might change, but we probably will end up with giving on the leaderboard 400 questions for training adaptation, giving 200 question answer pairs for development, and we'll keep the 500 questions for testing blind. This way we can have a meaningful. Yeah, go ahead. So people we have been using for a number of years for doing linguistic annotation. Like for example, they're familiar with information extraction annotation. So they are used to language annotation, not technical, per se. So they're not technically? No, no. It's very hard. Actually, we use a couple of people who are SIS administrators to answer the questions. Even that, that's hard for them because there is a variety of topics here. It requires a lot of expertise to answer these questions because different products have different kinds of knowledge needed to really answer. So they did not do a whole lot better than our scheme of using the outside world to give us the answers. And then, OK, and I think I did this. So as I was saying, I think we hope to have this over the next couple of months. We expect to have a setup where we will give a question and end tech notes, end documents. I think 50 will be the number, but that might change. We'll see. And there is a good chance, or I believe it's 50%, that one of the tech notes contains the answer. But there is also 50% that don't have answers. And the goal is to find the answer, basically, the shortest span of tests that has the answer. And we will use, for the reading comprehension, which is that setup, the QA metrics. But also, we can add some IR metrics because we could ask for what's the document ID? Because you have to pick which one of the pages is the right page. It's not a great IR competition, but it's just a useful metric to have. And as I said, another month or two maybe, we'll have a leaderboard. And I'm looking for people who are interested. Maybe we can do a preview. As soon as we have the leaderboard running, we can have some people who are interested to try it. And then they can be on the leaderboard early. I'd like to work with you, like you, to work with us if you are interested. I think that's all I have. To give this to you. OK, thank you. So we are going to open to questions. Mr. First Part on the AMR Parsons, what did you talk about there? Yeah, I mean, basically, really it's a review of our AMR parser in terms of how the flavor of how it works. And were you there when I talked about the IT support domain for AMR parsing? Yes, yes. So that's really the one area where I feel we need a little bit of help. I guess a more serious question would be, so you're using this squad kind of systems, which are end to end. And you're also developing parsers for AMR. I suppose you use that for question and answer as well. So what do you see as the disadvantage of each approach? And why do you need to do both? Right. So AMR parsing is a way to compress the information in the natural language into a reduced representation across many instances. So that could be a better representation than the word sequence that we used today in our reading comprehension. How to use AMR graphs into a reading comprehension system is a subject of active research of some people in this room. I'm looking at them. So when they get positive results, they will talk about it. So that's really one approach for the use of AMR graph. Another approach that we are also going to explore is AMR is a bridge to more logic-based representation of the meaning. And now you could use reasoning to help you answer some questions based on the AMR. So there are sort of two purposes for the work on AMR. Or really, I don't per se say it has to be AMR. It's like whatever we can get our hands on. And AMR is clearly the practical point for us at this point. So since you are here about AMR, there is one thing that we noticed that I didn't present it when I was presenting. There is the trick of using ARC off to simplify things. And unfortunately, that's not normalized in the AMR bank, which means that if you produce the other version, which is ARC zero structure, you will get penalized in the metric. So if you normalize, you can improve your scores by one point. But we don't do that, per se. But we would like to propose it to become part of the standard at some point. So it's a small thing. You had that table where you showed that for domain adaptation, where you showed that the performance improves with fine tuning as you add more question answers. How did you select those question answers? Oh, this one was just random sampling, nothing special. Is there a way to employ some other strategies? Yeah, it's cheating, right? Because the reality of it is this is after the fact that you already have a big corpus so you can pick the right question answer pair. But in the real world, you go to a new customer, they're going to give you their 300 questions. So you want the random sample, I think, from the point of view of understanding how much is needed. Unless you can come up with a very clever way of how what's the right way to add a question by a human given the previous questions, which is not practical at this point. I would say this work here is preliminary. I am not convinced of the numbers yet. So this is just to just get a feel where we are. We still have time for some questions. I guess I'm more generally curious about what it takes to get the number one position on the leader board for, say, natural questions as you did. So how do you decide when a particular idea you have is worthwhile or is not going to work? Do you do extensive hyperparameter searches over every configuration? Or is it just the case that you had one idea and it magically worked and got the first position? No, no. Realistically, you have a deficit. So you typically can predict sort of how much. I mean, it's not guaranteed, because sometimes we think we have something that works, doesn't pan out. But it's definitely more of the latter, which is you have many ideas. You try them. You find you on the ones hyperparameter search for the ones you think are promising, like you only have finite time. So it's really, you have to try many things on the deficit until you decide it's worth now submitting a system. So I think we submitted maybe two times or three times. Three times. Three times. So we started in April time frame. So from April to August, five months. So every month and a half, roughly, we were submitting a system. We've done enough work in six weeks that there is movement. Now, no guarantees whether we will keep moving or not. But I would say, generally speaking, things tend to move. It's not a huge effort if you are interested. Thanks for a great talk. So I'm going to ask a speculative question about using AMR. So I haven't done any work with AMR, although I'm aware of it. So you did mention that this sort of trying to use AMR for question answering, it's still sort of this work, ongoing work, whatnot. But I have a different use case where we want to take specs written of software of some kind and then try to transform them into some formal models. So that is stated in some logical formulas. So one way I can imagine doing this is to do it taking something like taking a dependency parse or a constituency parse and then transform using that as the intermediate structure to transform. Because the data sets again that we have for this are really small. So I would like to take advantage of any intermediate structure that I can learn reliably well and transform from that to this form. Could you comment on whether if you were to do something like this, would you trust AMR parsing to be accurate enough to do this versus going to, let's say, starting from something like a statistical parse? I mean, of course, we can try empirically. Yeah, I mean, unfortunately, the answer is always the empirical thing. But the question is, where do you invest? I would say the domain adaptation challenge is really fundamental, even though we have condensate dependent vectors to help us improve these things. So because of that, I think I would be careful about AMR yet to be doing a great job at parsing, even though I assume these documents are fairly simple in terms of the sentence structure. So it may do very well, because if the sentences are short and simple, then the AMR cannot go wrong. It's like it's going to find the answer. So dependency parsing, I mean, AMR is really solving the things you need to do with the dependency parse at the end of the day, because you have to deal with coref. We have to deal with who did what to whom. So the distinction between AMR and dependency parsing may not be as dramatic as one might think. I don't know, he is more an expert on semantics, so maybe Bert can answer this question again during your talk or now if you want.