 So it's our time for our next speaker, Mohit Ayer. So Mohit is an assistant professor in the computer science department and the University of Massachusetts Amherst as his research focuses on designing deep neural nets for both traditional and NLP tasks like question answering semantic analysis and new problems that involve understanding creative language. For example, modeling fictional narratives and characters. He received his PhD from University of Maryland, College Park and then spent the following year at AI2 as a researcher. And today he's gonna talk about contextual question answering and generation. Yeah, cool, thanks. So today I wanna talk about both some projects that have already been completed by my group and also a couple ongoing projects and the students who are all associated with those projects are here and presenting posters. So if you wanna learn more about them, you should attend the poster session. Okay, so broadly I wanna talk about three things. Most of this talk will center around generating questions or generating question answer pairs from various forms of context. We'll talk a little bit after that about a specific data set called Quack, which involves conversational question answering. So answering questions within like a dialogue style context. And finally we will conclude by talking about some sort of pre-trained language modeling objectives for structured tables instead of just unstructured text and how those could help potentially with question answering tasks. Okay, so we'll start with generation. And so the first thing I wanna talk about is this project that my student called Payesh in the back did called Generating Question Answer Hierarchies. And the basic motivation of this task is that if I'm a company or some large organization and I have access to huge amounts of text, I probably want to summarize or otherwise display the important contents of that text in a way that someone new to the company, say a new employee or a new customer can make sense of it and get the important information out of it. And so I'm not just gonna give this employee like five million documents and say go learn everything. I'm gonna try and help them through this process. And so there's various ways of making a text more readable. Obviously, there's formatting things you can do. I divide my document into paragraphs. I have titles like in my papers. I can use various forms of markup to emphasize things. And there's more sort of NLP tasks like summarizations that get to this concept. And what we are actually proposing is more along the lines of FAQ generation. So like frequently asked questions. So given some sort of document or collection of documents, can we generate a bunch of question answer pairs that summarize the important information in that document? And I mean, most of you are familiar with FAQs, probably have interacted with them. I shouldn't have to convince you that there are useful forms of knowledge storage. But anyway, if you want convincing there are citations so that you can look up. And our task is called squash. Specificity controlled question answer hierarchies. So basically we take a document and we generate this sort of forest of question answer pairs. So there's different levels as you can see here. Actually maybe my cursor will show up. So in this top level of question answer pairs, we will mainly want to generate very general generic questions. Like what is this document about? Or who are the main entities, stuff like that. And then for every one of these general questions, we want to unfold them into more specific questions. So for example, if my general question was who is the main character in this novel? Then I might ask things like where were they born? Or who are they related to? Things that are related to that general question. And the task is pretty broadly defined. So I can have as many layers of the specificity as I want. But of course when we actually did this project we had to simplify things. So we looked at just two levels. And so I'll just start with an example that our model actually generated to give you an idea of how this works. So this is a paragraph from the Wikipedia article on the bin, Massive Attack. And one of the paragraphs is talking about how they released some of their songs through this iPhone app called Phantom. And so one of the general questions that we produced was what was the iPhone application Phantom? Seems reasonable. And you can see a specific question who created it. So this question is linked to the general question who's parent it is. And so on. So for every sentence, every important entity in this document we will generate a question. We might generate specific questions for these general questions. We'll do some filtering afterwards to make the final tree of questions look somewhat interpretable. That's our goal. So with our two level question answer hierarchy we have just general questions and specific questions. And so one problem that we encounter when we're trying to formulate this as a generation problem is that we don't have data for this. There's no huge scale, say, reading comprehension data set that has been annotated with this as a general question. This is a specific question. But luckily we have access to some reading comprehension data sets and there are people, there are previous papers that tell us these certain kinds of classes of questions are general and these are more specific. So we basically use these rules to automatically annotate some portion of existing reading comprehension data sets as either general or specific. And then we manually annotate some percentage ourselves, train a classifier and annotate the rest of these data sets automatically. So this is the paper I was referencing by Wendy Leonard from 1978. This is actually a pretty fascinating paper. She categorizes every single possible type of question that anyone could ever ask into like 13 different classes. And so for each of these classes we initially started doing it this way. We were like, well, let's devise some rules or train a classifier to detect each of these 13 types, but that's very difficult. So we eventually collapse the 13 types into just two categories, general and specific. And afterwards we had a crowdsourced annotation to see, you know, do actual mechanical Turk workers agree with our annotations? Would you say, given this question that it's general or specific and we got pretty high agreement? So the data set is more or less high quality. So then doing a generation task like this is not as simple as simply training your seek to seek model on question, sorry, documents and questions. We had this huge pipeline that we used to finally produce something that looked reasonable at the end. So I'm gonna just briefly go through each of these steps, which also explains how this whole system works. So the first thing we did was given a single document. So here we're just dealing with single documents, not collections of documents. Given a single document, we're going to have a span selection step. So this step involves for everything in this document, what would make the best answers to any question that I could generate? So for example, entities, right, are usually pretty important. I might wanna ask a question about an entity or something else important that happened to an entity. So for specific questions, we select just entities and numerics as answer spans to generate questions about. For general questions, we had this idea that, well, their answers are probably longer than just a short span or a single entity. So maybe we can extend their answer spans to sentences or more than that. And so I should say at this point that we're using as our data sets, squad style reading comprehension data sets. So the answer is always marked as a span of text within the document. So once we have our candidate spans, things we want to ask questions about, then we, in our paper, train a seek to seek model to generate these questions conditioned on the document and the answer span. But after we submitted this paper and it was published, we decided to improve it by, instead of training from scratch, fine-tuning an existing language model to produce these questions, that seemed to work a lot better. So we would highly recommend doing that if you were doing any sort of generation stuff at the moment. So we have, we train or fine-tune our question generation system and then we have to actually generate these questions, right? So that always involves a process of sampling from the trained model. And the thing with question generation is that many questions that you sample from the model won't be relevant at all. Like maybe they don't have anything to do with the answer span, maybe they're ungrammatical, maybe there's just something that you would never ask. So there's lots of things that could go wrong, which is why we needed to have this filtering step. So one of the main ways in which we filter the generated questions is we took an existing question answering model and tried to use it to answer the questions that we generated. And if the existing QA model produced answer spans for that question that matched the candidate answer span that we fed in, that's a good sign that this question is actually reasonable, right? But if it diverged, like if it said it was unanswerable or if it said that the answer span is somewhere totally different in this document, then that's probably not a good sign. And maybe we wanna throw such a question out. And there were other things too, to get rid of ungrammatical questions and so on. This filtering process was quite painful. Finally, that was just the process to generate a single question from a single span, but we do this for all of these candidate spans. And then after, for those that pass through our filtering pipeline, we need to structure them into this hierarchy. So in this project, we don't actually learn the hierarchical nature of these squashed documents, but rather we just use a heuristic approach to a given independently generated questions, organize them into a meaningful hierarchy. So here, for example, we have this article about Yoda and Star Wars and stuff. And so in this general question, we generated this from this whole answer span, Yoda battles, Palpatine, and this lightsaber duel, blah, blah, blah, that whole sentence. So given that sentence, our model generated what happened in the battle. And then for all of these highlighted entities like Senate Rotunda, we generated this specific question, where was the battle? And so as a post-processing step, we just assigned any specific question that was to the closest general questions answer. So it's not learned, that's definitely something we wanna do for future work. And also none of these questions depend on each other when we're generating them. So this is another kind of weakness of our model, yes. Yeah, so I'll get to that. We generate pronouns, but it's not because the model is actually intelligent and knows that it's already referenced this entity before, but rather that the data sets that we're using, some of them are dialogue-based QA, so they have pronouns in the questions. I hope you know that. We don't, we don't, oftentimes it does, but that's just, it's likely that this question would refer to the same entity as the general question because they're close in proximity. There's nothing else that forces them to be the case. We could have done some sort of post-processing using coref systems to filter these out as well. We didn't go that far. Probably. So the underlying data sets are squad, quack, and coca. I believe those probably contain some level of gender bias, depending on how they sampled Wikipedia articles. Wikipedia, I think if they talk about one entity, usually the coref will be kind of- Yeah, so for example, in the quack data set, which I'll talk about in a bit, we only used articles about people, so Wikipedia articles about people, and most Wikipedia articles about people are about men, so that's how you get gender bias in there. Yeah, so it could also be copying from the context. Yes? When you generate a specific question, do you condition on the general question? No. So you classify them later? Yeah, so we generate all of them independently, so that specific question doesn't know that this is the general question that, but we would like to do this. It's just hard to get data for this kind of thing, right? We only really know the general and specific labels, but getting some additional dependence, yeah, that would all have to be latent, becomes hard. Well, yeah, but through proximity, there are more often than not still related, but yes, there's no constraint that makes them this way. Yeah, we could plausibly, might be something to try in the future. So one thing I realized, when we are asking questions, if we can say our name and where we are from, so that others can also know us, yeah. Okay, so if you have more questions about this, please talk to Kalpesh at his poster, which is also about this project. We will also have a demo, so you can try it out on random articles. Yeah, so I guess I'll skip over these. One thing I did want to show is the common types of errors that our model makes, and some of these are because of issues that you have raised already, like we don't have dependence between general and specific, we don't have dependence between adjacent questions, but still like there are some good things. It's not all bad. So in this example on William Jennings Bryant, the general questions are actually quite interesting and informative. So what was a treaty is pretty generic, but why was this bad is a nice question, right? People might want to know why this treaty was bad and what was a result of the resolution. But on this right-hand side, things sort of fall apart. So here, the general question, what are his parents like? If you look at the specific questions underneath, we get who was born in Springfield, so we get a good answer for this. Where was Weston born? So I guess I didn't believe that previous question's answer. Who were his parents? Where did he move to? And then this question, how old was Weston when he was born, which is clearly a lack of common sense. Yeah, so we would hope to improve. I think, so these, this question at least, was generated from our trained from scratch model. I wonder what would happen with a fine-tuned model if it were evaluated on the same document. Hopefully that wouldn't happen, but yeah. Okay, so we evaluated the system primarily through crowdsourced human evaluations. And in summary, people liked the quality of the questions. They thought they were relevant to the document and they thought that more or less the hierarchies seemed reasonable compared to just a random hierarchy, so that's not a very strong statement. But qualitatively, we don't generate many insightful questions, and I think one reason for this is because the data sets we're training on, so Squad and these dialogue QA data sets, themselves don't contain interesting questions, right? They're more like, where was this person born? Or what team did this other team play? These are maybe useful for companies who want to isolate this kind of information, but we were more interested in questions that seem natural, that people would actually ask when they're trying to learn about some new topic. And of course, all the problems that you've raised with the discourse. There's also the common sense problem, so the group killed their audition when they showed up in costume. How did the group die? Again, I don't know what would happen with the fine-tuned model, but yeah, this is obviously a problem with many generation systems. Okay, and then one of our main motivations was that this could help people learn about things, right? And one question that we actually wanted to answer when we started this project was, is this format of hierarchical question answer pairs or FAQ type things more useful for someone learning about a document than just a traditional summary? And this is not a question that we were able to answer in the end. There is lots of prior work showing that hierarchies are generally good, but this explicit comparison between FAQs and summaries has not been done before. And what we are planning to do in the future to actually test this question is run a human study. And one of the issues is that our system is not good enough at the moment, like compared to a traditional summarization system for us to run a fair comparison. So what we would like to do is manually create some of these squashed versions of documents and compare them to gold summaries in like a summarization data set, but we're still trying to figure out what the best way to set up this user study is. Like what is the end goal of the person participating? How do we measure how much they've learned? Do they take a quiz after reading the summary? Or, so these are all questions that were, for NLP researchers not really something we're too experienced in doing, but we wanna do it to sort of advocate for this format in the future. All right, and if you wanna play around with our demo, you can go to this website and dump in random paragraphs and see what happens. So I wanted to, before I stopped talking about generation, quickly discuss some current work in progress that's kind of stretching what we attempted to do here to explicitly a specific type of question that the squash system could not generate before, or at least not well. And these are hypothetical questions. So what we did was we went through the common crawl, which is just a huge dump of the internet, and we extracted all questions from the common crawl. So this was just with simple rules like any sentence that ends in a question mark. And you get a ton of them. So you get 570 million questions out of the common crawl. A lot of these are garbage. We tried some filtering to reduce the amount of garbage that we got. So we threw out all short questions. A lot of these are just like, people expressing some emotion or something. So when we did that, we got about 280 million. And then we wanted to also have some context with these questions. So we decided we'll keep any question that has at least four sentences on each side so we can try to model some context that led to generating that question as well as possible answers that occur after the question. And then we did some annotation on these questions. So the common crawl is very noisy, right? Questions come from Reddit or random blogs, random articles, there's lots of non-English stuff. It's just a complete mess. So we did some annotation to see how many of these contexts, so how much of the text is actually coherent? So it's actually not as high as we had hoped. Like 23% of the contexts are unintelligible. The questions, how meaningful are they? So more often than not, they make some sense and they're relevant to the context. And yeah, I guess I'll skip over this. But the more interesting part is we did an analysis of what types of questions people ask in the common crawl as opposed to say, squad where you just assumed that everyone wants to ask these like factoid or questions about numerics. But in the common crawl, it's very different. So many of the questions are rhetorical, which makes sense, right? These are people writing questions. They're maybe not asking them to be answered, but rather as like a writing device. Factual questions of people asking for some sort of fact or information is only a fourth of this data set. A lot of it is people asking for an opinion. So how do you feel about this product or something like that? Yeah. Yeah, yeah, so sorry, I should have said that. Yeah, so we took, I forget how many questions, 200? Yeah, 260 questions. And yeah, I just had my lab annotate them for like one, one and a half hours. And that's how many we could annotate. So we just selected them at random. This is pretty unreliable at the moment. We haven't yet computed agreement on this whole thing. We need to get multiple annotators, but it's still kind of interesting. Yeah, we might have to do more, yeah. But what we were actually interested in was this 18% of questions were hypothetical in nature. This kind of makes sense, like hypothetical questions people often use as a writing device. So let's look at what the difference is from simple factual questions to hypothetical questions. So here, what is the state capital of Minnesota? How many babies were born in Massachusetts? Where do we go to register for graduation? These are all things that you might find in squad, right? But what we find in the common crawl are more questions like if Minnesota and neighboring states merge, where would the resulting state capital be? Or this other longer one. So in general, these hypothetical questions consist of two parts. There's one kind of imaginary scenario and it ranges from like very plausible situations like what would happen if so and so took office in whatever year. Two very fanciful scenarios, right? Like what would happen if a unicorn attacked tomorrow or something. So there's a wide range of these imaginary scenarios and there's also this question that you ask about the hypothetical. So like given this imaginary scenario, I need to ask something that people would find somewhat interesting about that scenario. So there's a lot of like common sense knowledge involved I think in trying to figure out like is this a reasonable hypothetical to ask given this context and if so, what would I wonder about this hypothetical, right? Out of the set of things that are reasonable to wonder about. So we just did like some very preliminary experiments where we used some rules to identify hypothetical questions from the common crawl. These are basically things like, does it contain the word if in it or does it have some length? Does it have prefixes like what would happen if blah blah blah? So we can definitely do better but we just wanted to see what would happen. So this context here is about the Super Bowl and the gold question is if you had a choice which one of the three would you rather listen to on television out of like three broadcasters I guess? And so when we're generating these hypotheticals we seed our model with like a prefix that is sort of associated with hypothetical questions. So for example, what would or who would, the word would indicates that you're likely gonna ask about some sort of hypothetical situation. But if you actually read these they don't actually make too much sense within the context and even if we give the actual gold prefix. So the gold question in the common crawl was if you had a choice blah blah blah. So we fed the language model if you had a choice and it said if you had a choice between working four hours a day at whatever that is or having to leave which one would you choose which is like completely irrelevant. So this is like one of those cases where we were really unable to find many if any questions at all which GPT2 or it's fine-tuned variant generated even a remotely plausible question. So that's kind of interesting also a little underwhelming we were hoping for better but we're hoping with maybe integrating some of these. Yes. Yeah, on our training split of that. Yes. Four. No, it's a randomly picked number. Yeah, that could also be interesting maybe increasing that. Yeah. Okay, so yeah, this is still work in progress. We're hoping with maybe using some of these newer common sense data sets and models such as Comet which was recently proposed by some UW researchers will be able to improve on this task but still pretty fun and we'll have to figure out what we'll actually use this for eventually. Okay, so I think I'm low on time so maybe I'll go fast through this next part. Okay. So here I was going to talk about our quack dataset which came out last year. I'll basically just skim through it and show some examples but mainly wanted to highlight the motivation of this. It's another type of contextual question answering problem that in our squash project we cast as a question generation problem where we just ignored the dialogue context. But here our main motivation was this to study information seeking dialogue. So given that you want to learn about an area but you know nothing about it or you know very little about it but you're talking to an expert who actually knows a lot of information about it. What questions would you ask? How would they respond to sort of encourage you to ask more questions? And yeah, what is like the possible space of dialogue trees that could be produced from one of these dialogues? So I have an example here, I guess I'll go through. It's pretty dumb but this was a sort of real example of an information seeking dialogue that I had when I went to Montana and I was hiking and I was afraid of, I saw a sign about bear attacks. So I as a student here was asking the hiking guide how big are grizzly bears? And the guide said they're huge. The adults can be like insanely large. So then of course I was wondering do they attack humans? And you know the guide said rarely but that's not never. So I asked since they're so big how do I protect myself? And the guide said this very frightening statement, bear spray is usually effective. So of course once I hear this I'm going to ask more about when is it not effective? What do you mean by usually? And sometimes they're just impervious to it and walk through it. So then what do you do? And the guide says, oh well you should play dead. Can I climb a tree instead of playing dead? But they can climb trees too apparently. And then this is a dumb question that I asked at the end. But you can see how this dialogue really unfolded as a result of the answers that I was receiving from the teacher or the expert. So it's not like I knew everything before starting this conversation. The things that I learned influenced the questions that I asked next. And so we tried to imitate this process in the data collection framework where unlike squad and some other data sets we did not let the person asking the questions actually see the context that they were asking the questions about. So I'll skip over this. But we basically paired up two mechanical Turk workers in a chat room and one of them had a Wikipedia article the other one had no access to this article they just had the title. And so the one who just had access to the title the student had to ask questions which were answered by spans of text from the document from the worker who had access to that. So we got about 14,000 of these dialogues with a hundred K questions total. And it's pretty cool. There's a lot of interesting dialogue phenomena that are going on in this data set that people have mostly ignored in favor of improving their leaderboard scores, which is fine also. But one thing that was interesting about this data set when it came out is that there's a lot of different types of questions. So compared to data sets such as squad we have a lot more like why and how and how did kinds of questions that are, people ask these more frequently when they don't already know the answers to the questions I think. Okay, so we evaluate this QA models on this data set using the same kind of squad style F1 spin overlapped evaluation. And we in 2018 had our base model with like Bydaf and Elmo at that time was the best thing out and got an F1 of 57. We are kind of proud now that this is one of the few data sets that BERT has not just destroyed. It has improved significantly but there's still something left. So this is a snapshot of the latest leaderboard. We see a few days ago there was history attentive trans BERT which got an F1 of 73. There's still some headroom here. One thing that we are kind of worried about which is why I kind of ask you this question is that we've noticed that many of the more recent submissions to our leaderboard are from these giant industry labs and we're kind of worried that the architectures or the models that are proposed here are not actually what's getting them this performance but just the fact that they are able to do much more hyper parameter search than say an academic research lab, yes. Yeah, but it is sad for the ideas that may have worked but just we weren't able to find the right parameters. Yeah, I think this is the more likely thing is that at some point everyone will try tuning. It'll be tuned too much like squad was and then many of those models will not be used again but we'll move on to the next task and it'll be something that hopefully links all of these leaderboards together. Okay, so I will go quickly through this final thing which is about kind of shifting these ideas that people have had about contextualized language models, large scale pre-trained language models to structured data in the form of tables where I think we can also benefit from these same ideas. So to give an example of the kinds of QA problems that I'm interested in in this space, basically if you imagine like an HTML table or some small database, here's one from Wikipedia on the women's water polo world cup. We have some columns, we have some cells and we might have a question like which nations competed in the water polo world cup? And so we treat this as a semantic parsing problems of different than reading comprehension. Here we want to produce a logical form that we can execute over this database table to give us the answer. So here I might in a SQL like language generate something like select nation and then this can be executed to return the cells in the answer. So we had this dataset which is kind of similar to quack in that it basically does conversational QA over tables. So I can have contextualized questions like of these nations, which one took home at least one gold medal which refers to the previous questions answered. Yes. Oh, you mean, wait, sorry, I didn't quite understand. Oh, yes, sure. Yes. Yeah, who knows what this table is actually showing. But yeah, it's probably the funnel. Yeah, yeah, yeah, no, I definitely chopped it off to fit on the slide. Okay, so quick high level overview of how we solve problems like this. When we collect the data, we have mechanical Turk workers again, write these questions and then highlight the cells in the table that correspond to the answers. So here just like squad and other datasets, we have a constraint that the answer must be in the table. So with this, we do not have them provide the logical form, right? We don't have them write out a SQL statement. That would be, you can't expect that from a mechanical Turk worker. So we only know the final answer and the question, but not the intermediate logical form. So we treat this as a semantic parsing problem. We have only weak supervision. And so in this particular paper, we use something called reward guided structured output learning to solve this. So I'm just gonna go over the intuition of how this works. If you have a question like which nations won exactly one gold medal? We're just going to search through the space of all possible valid SQL commands. And we're gonna intelligently search through the space, but we're basically going to do a search. It's guided by this policy function. And at the end of the search, after we've completed and we found like some, a number of SQL statements, we're going to pick the one that has the most overlap with the ground truth answer. So when we executed over the table. So for example, select nation where rank equals four gives me the identical answer to the question, which nations won exactly one gold medal? Of course, this logical form does not correspond to this question. It's just a coincidence that the answers happen to be the same, but through this search process, we might find this path. And so we're going to use this path that we found as supervision for training the model to favor this SQL statement next time around. So it's like an approximate ground truth that we find in one pass, and then we update our model in another pass. But this is sort of irrelevant to the point I wanna make, which is our model that we trained on this data set is very bad. It gets like 12% accuracy on this data set. And so we did an error analysis to figure out what is going on here. Granted, this was a couple years ago, so maybe with the new things we would do better. But one of the biggest problems was that we lack enough world knowledge to match entities or some text in the question to cells or headers in the table. So as an example, this is in our data set, we have this, this is again a subset of a table where we have these three columns, call sign, city, and genre, and then a question like, what radio station plays synth wave music? So to answer this question, you really need to be able to map radio station in the question to call sign in the table. But this is just something that the model is gonna have a tough time learning. I mean, this is not all of the columns in this table. There are probably 15 or 20 of them, so it becomes very challenging. So this work with June, who's sitting over there and presenting a poster about it is, can we improve on this problem by using contextualized table representation? So traditionally, you might encode each of these cells separately, like with, say, glove embeddings or fast text or bird or whatever. And then you might learn from scratch some sort of model on top that combines information across the cells in the table. But that model can actually be learned through some sort of pre-training process. So that's what we wanna do here. And I think people are now using this term self-supervision for things like this, so we're also gonna use it. So given this table, we can design a bunch of self-supervised objectives that can help us learn good representations for cells or rows or columns. So for example, given the representation of the cell, can I predict which of these column headers it belongs to? I can do this for free if I have a ton of tables. So what June has done is, again using Wikipedia extracted 1.5 million tables. And now we have a pretty big dataset to train these kinds of representations on. So this is just one example of an objective. You can also do things like, given two cells, should they be in the same column or not? We can have embeddings for rows in the table and then ask, given this row embedding, do these cells belong to this row? And so on. So there's a lot of room for exploration here. And yeah, I guess I'll conclude by saying you should talk to June for more details about that. So that's a good question to understand. Yeah, so that's a good question too. We are looking at this like, if we have better table embeddings, it'll just help us in general, but there is the problem where our question text is not included. So that's another thing that we're considering, including in this table embedding project, where we can also extract the sentence or text in Wikipedia that references the table. Yeah, so that's something that we're also considering doing. It's just kind of hard because the tables are often referenced very obliquely, like it might just say parentheses table four, but it's not very clear why it's being referenced or maybe for some very specific thing. So, but yeah, this is definitely something we're considering doing in the future. I'm not familiar with this work. Oh, yeah, so I don't think Percy ever did something on huge scale, oh yeah, yeah. So wiki table questions, yeah. So our data set, this one that these questions came from, it was built from wiki table questions. We basically asked people to decompose that or have a conversation about these tables. Yeah, so it's using the same tables as that data set. All right, so that's it. Thanks. We have a quick one minute for a question. Anyone has any? So I have a quick question, like some of your slides were showing examples of question generation from existing question answering systems that work for English. Let's say, how would you, do you have any ideas of fine-tuning them for other languages? Yeah, so one thing that we noticed is the common crawl has huge amounts of text in other languages as well. So a lot of the stuff we're doing now is reliant on the common crawl. Obviously English is the primary language, but oftentimes in these pre-processed dumps of the common crawl that people release, they remove all non-English text. So I think if you wanted to do stuff on different languages, I mean, oftentimes people do things on different languages, Wikipedia pages, but those are usually very small for many languages. I one of my colleagues at UMass, Brendan O'Connor, did this study of mapping the number of Wikipedia articles for a particular language to the number of speakers that language has. And there are so many languages with a huge amount of speakers, lots of Indian languages, for instance, that have very small Wikipedia pages. So I think we need to look at other sources of data for these things. Creating these resources like squad and so on for, is obviously very expensive. So that's why we've kind of shifted to just using common crawl, things like that. But I don't know, it might not be enough. Yeah. Question will be quick, I don't know if the answer will be quick. So you've created a bunch of question answering data sets now, right? So I want you to sort of comment on a little bit on how difficult it is to create data sets that do not have weird artifacts and things like that. Yeah, so that's any lessons learned and do you see this, how to improve the way we are building data sets? That's a great question. So this is why I don't wanna create such crowdsourced data sets in the future, I'm done with it. So even in quack, this was probably the data set that we spent by far the most effort on trying to remove or reduce the amount of noticeable artifacts we could get. But it was very challenging. There are so many ways for crowd workers to game any sort of payment system. For instance, they realized that by having longer dialogues, they got paid more. This was one of our incentives to make the data set bigger. But then we noticed people typing in the question text itself, please don't say unanswerable because it will kill our dialogue so we need to continue. Did you just give me a minute? So we had, there's so many things like this. Like you try and do something good. You try and encourage longer dialogues, for instance. But then it always will backfire. And I think many data sets, another point is that mechanical Turk workers, for instance, they don't have, they won't read a long set of instructions. And many of the tasks that NLP researchers want to collect data for are actually very complicated. They have lots of edge cases. They have desired behavior in certain instances. And you can't, they're not gonna read that. And that's why if you look at SNLI, for example, the instructions are very vague. The instructions that were shown to the crowd workers. I think that is actually the source of many of the artifacts in that data set. Like what is a neutral pair of sentences versus a contradiction? It's kind of very blurry and the examples weren't clear. And one thing I think the field should do is sort of have a study on the impact of all of these sort of annotation design level decisions on the quality of the data set. So like if I reworded my instructions in this way and showed a different set of examples, would that give me better agreement or less artifacts or something like that? I think that'd be pretty cool. But yeah, right now I'm not doing anymore such projects.