 Welcome. This is the first of our AI horizons network seminar series. For our first seminar, we will be having Emma Struble. Emma is a Ph.D. student at UMass Amherst. She's going to be talking about her EMPNLP paper, which won Best Long Paper Award 2019, titled Linguistically Informed Self-Attention for Semantic Role Labeling. Emma is a student at Andrew McCallum's group at UMass, previous degree from University of Maine, and has spent time both with the Alexa NLU team and at Google AI in New York City. So broad background, lots of interest, and this should be an outstanding talk. The intent is that everybody will be muted while Emma is talking, and then we'll open it up for questions afterwards. So let's figure across that this will all work for our first seminar. I have the chat window open here, so if you have concerns or you can't hear or something, there's a next chat facility if you can figure out the icons and you can send them to me and we can try to fix them in the background. But without further ado, Emma? Great. All right. Thank you so much for the introduction. So today I'm excited to present Lisa, which is a new method for multitask learning where supervision for transfer between tasks is done through the attention mechanism yielding new state-of-the-art results for SRL. This is joint work with my lab mate Pat Virgo, my collaborators Daniel Andor and David Weiss at Google and my advisor Andrew McCom. And as Brent said, this was originally presented at last year's MLP conference. So over the past 10 years or so, MLP research has made so much progress that it's no longer just an object of study, but something that practitioners want to run at massive scale in order to extract meaning from text. So the most techniques to be computationally efficient while also obtaining high accuracy, not just on news articles, but also across different domains. And the model I'm going to present unifies deep neural network models with linguistic structure in order to perform fast, accurate, and robust MLP. And in this work, we're going to focus on the extraction task of semantic role labeling, which I'm going to briefly introduce now. Okay. So the task of semantic role labeling is typically cast as who did what to whom. And basic models for this can be built by designing rules over parse trees, but this doesn't tend to scale well across diverse sentence structures. So instead, we typically directly label the spans corresponding to the predicates and their arguments in the sentence. So what this looks like is, for example, in this sentence, we would identify the predicate's awards and advance. And we can then extract the arguments with respect to each of these predicates. So with respect to the predicate advance, sorry, awards, the agent argument, or the one who's doing the awarding is the committee. It's theme or the thing being awarded is the Nobel. And the beneficiary or the one receiving the award is Strickland. So that is the labeling with respect to the predicate awards. Similarly, for the predicate advance, that agent is Strickland and the theme is optics. So these are the two semantic role labelings of the sentence with respect to the two predicates. And specifically, we're going to be using the prop bank after all framework, which assigns more general argument labels shared across all the predicates. So this is what those labels actually look like in practice. So we want to extract semantics, but understandably, this task is deeply intertwined with syntax. And historically, SRL models have indeed relied heavily on syntax. So looking at the past 10 years of semantic role labeling on the benchmark 2005 shared task, early SRL models did rely heavily on syntax, typically using a combination of syntactic features on the input as well as syntax-based constraints during inference. More recently, there's been a trend where end-to-end deep neural networks, which use no syntax, are surpassing little linguistically informed counterparts. And this is a trend throughout NLP, not just for semantic role labeling, but across many basic NLP tasks. And these models have indeed improved substantially over models that use syntax. So getting up to 23% error reduction. But the trends I've just described are mainly when evaluating on text that comes from the same domain as the model was trained on. Out of domain, we've seen much less impressive improvements. So in the setting, deep neural models have so far reduced error rate by only 8%. And as I'll describe later, there have been some attempts to inject syntax into neural models, but the results on out-of-domain evaluation of these models has been mixed. So in this talk, I'm going to show that syntax really can help in the midst of deep learning with the right modeling. So our lethal model obtains an additional 10% reduction in error over the best neural network model when evaluating out-of-domain. And it also improves in-domain SRL by an additional 6%. So now I want to overview the key components of the lethal model. The first component is multitask learning across four related tasks. Part of speech tagging, label dependency parsing, predicate detection, and semantic role labeling. But we do more than just simple parameter sharing. The interaction between these tasks is through what we call syntactically informed self-attention. So here we're going to supervise one attention head to attend to syntactic parent. And through the supervision, the model learns to use that attention head as an oracle, providing syntactic information to downstream layers. At test time, we can use either the syntactic structure that's predicted by the lethal model itself, or we can inject syntax from any external parser to improve SRL accuracy without having to retrain the SRL model. And another nice benefit of this multitask learning is that we only have to encode the sequence once to predict all these tasks, unlike most previous work. Okay, so now I've given a broad overview of what Lisa is, and now I want to describe in more detail exactly how it works. So I'm going to begin by describing multi-head self-attention since we modify the internals of that mechanism in order to better model information flow between the different tasks. And for those who are familiar, this is going to be exactly the encoder portion of the transformer model that was introduced by Vaswani and others at Nuritz in 2017. So the way that self-attention works is given a sentence and some embedded representations of these tokens, say at layer P in the neural network. The version of self-attention that we're going to use works by first projecting each token into three distinct representations corresponding to that token's role as a key a query or a value in the self-attention. For each token, its query representation is compared with the key representation of every other token using a scaled dot product. And this gives for each token a score with respect to every other token in the sentence. So, and these are normalized by the softmax function, so this gives attention weights that sum to one between each token with every other token in the sentence. So for each token, these attention weights are then used to perform a unique weighted average over all the value representations of the other token. And then this results in a new attended representation for each token, where that token has observed the representation of all the other tokens weighted by their importance to that token. So this is the basic self-attention model. And an important aspect of the transformer model is what's called multi-head self-attention. And all this means is that at each layer in the network, we actually have H distinct sets of self-attention parameters, which are normalized separately, and so therefore they're learning H distinct self-attention functions. So this sort of just adds more capacity to the model. Okay, so these H outputs of each attention hand for each token are concatenated. And then that concatenated representation is passed through a feed-forward layer. So finally, this representation is added to the initial input of the layer via residual connection. And then this gives us the representation at layer P plus one, which is then the input to the next layer. Okay, so we stack J layers of this self-attention. And on the input, we use pre-trained word representations. And in our experiments we used, we tried both using glove, word representations, and then LMO contextualized word embedding. Okay, so now to describe how basic multi-head self-attention works, I want to describe how we modify this to incorporate syntactic information. But first, I want to put our approach into a little bit more context to help explain how we arrived at our technique for integrating syntax into a neural network SRL model. So one straightforward approach is to combine SRL and parsing through multitask learning with hard parameter sharing. But just like a single task end-to-end model, this will overfit to the training domain. And the model will have to be retrained from scratch in order to improve, in order to leverage improved syntax models or data. So more recently, other people have incorporated syntax into neural models for dependency-based SRL, either through dependency path embeddings or a graph CNN over the parse tree. And while these techniques can leverage new syntax without retraining, they only incorporate limited syntactic context, sort of limited to like a window around a given token. And it's perhaps for this reason that they have observed mixed results on out-of-domain data. So this brings us to our model, syntactically-informed self-attention. So I think this is a really natural way to incorporate syntactic information. Essentially, each token just attends to its likely syntactic parent. So this allows the model to learn more global features over the parse, since in subsequent layers, each token attends to all the other tokens. It's effectively observing not only its own parents, but those of all the other tokens with sort of like increasing distance as the model gets deeper. And it combines the best of both worlds of these previous models. So as in typical multi-task learning at test time, the model can use its own predicted parse or like methods that use dependency path or graph CNNs and externally generated parse from a separately trained parser can be supplied to improve the SRL performance without having to retrain the SRL model. Okay. So now I'm going to depict exactly how this works. So returning to our depiction of multi-head self-attention, typically each attention head is left to learn its own attention function from the data. And instead, what we do is we train one attention head to attend a syntactic parent. And we do this by replacing the query and key mechanism in that attention head by a fine syntactic parser of dozen and manning. So this is like a really good just graph-based syntactic parser. So this produces a matrix of edge scores between each pair of tokens. And we can simply drop this matrix in to replace the attention matrix that would typically be generated by the query key pairs. So otherwise, this layer acts like any other self-attention layer. And we choose one layer in the network as a syntactically informed layer. It's basically like which layer that is essentially a hyperparameter. Okay. So now to describe how we incorporate syntax into the attention mechanism, I want to describe the other important aspect of Lisa, which is how we do all these tasks, including predicate detection and just one single pass through the network. Okay. So here's the network I've described so far. So at an early layer, we predict parts of speech and predicate. And since the two labels are highly related, so like predicates are very often verbs, and in order to reduce the complexity of training the multitask model, we predict into the joint cross product space of part of speech and binary predicate labels. So at the syntactically informed layer following dozen and manning, we share the parameters of the parser to also predict like the edge, the parser that predicts edge dependencies to also predict dependency labels on those edges. Okay. So that's how we get labeled dependency parsing. All right. So now I'm going to demonstrate how we do semantic role labeling with respect to each predicate. So this is in contrast to previous work where predicates are provided on the input by modifying the predicate word with like an indicator embedding, which requires them that you feed the sentence through the network independently with respect to each of the predicted predicates. But since we predict predicates in the model, we can't have we can't do that because we don't have that a prior information. So now I'm going to describe how we model this. Okay. So we first project each final layer token representation into an argument specific representation of that token. So representing that token's role as an argument to a predicate. And then for each token that was predicted to be a predicate, we project that token to a predicate specific representation of that token. And we then combine these representations using a bilinear operator to score the compatibility between each token with each of the predicates. So here I've depicted scoring Nobel with respect to the predicate awards. And what we actually want to score for each token is not just a binary like representation, it's which SRL label that token has with respect to that predicate. So we actually have a bilinear matrix scoring the predicate with the token for each SRL with respect to each SRL class label. And then this gives us SRL label predictions with respect to that predicate. And so in the same way we score each token with the first predicate to get a complete labeling of the sequence with respect to that predicate. And we do the same for each predicted predicate. So in this case, we have one other predicate advanced. Okay. And this gives us a semantic role-living of a sentence with respect to each of the predicted predicates. And this is a nice model because in practice, this is really fast because we can actually score all predicates with respect to all tokens for an entire batch of sentences using just two big matrix multiplies. So this is like particularly fast on specialized tensor processing hardware like GPUs and TPUs. Okay. And this is the Lisa model in its entirety. So we have multitask learning across four related tasks with the signal for syntax shared in this special way through the attention mechanism. So I get a really dry throat, so I need to drink a lot of water. If you're not looking at me, that's what those pauses are. Okay. Okay. So now to describe the model, I want to highlight some of our experimental results. So we evaluate our model primarily on two benchmark datasets for symmetric role-labeling. And these are the column 2005 and 2012 shared tasks. So in this presentation, for the sake of time, I'm going to focus only on a subset of the column 2005 experiments, but I encourage you to check out the paper or ask me questions at the end if you're interested in more full experimental analysis and results. All right. So we present in-domain experiments on news data and out-of-domain experiments on novels from the Brown corpus. We experiment with both 100-dimensional glove embeddings and ELMO contextualized word representations that come from multiple layers of bidirectional LCM pre-trained on with a language modeling objective. And today I'm going to present only experiments in the more challenging setting of using predicted predicates. But in the paper, we also run experiments using gold predicates, which is like a pretty standard benchmark and so we can compare to more prior work. And our baselines are the state-of-the-art models on PropBank SRL. None of what she's syntax. These are highlighted here. And we look at three versions of our own model. The SA model, which is the Lisa model except with no syntax. The Lisa model when using its own predicted parses. And the Lisa model with parses from an externally trained parser, as well as with gold parses. So this allows us to see what the upper bound is for how much we could, how much SRL would benefit from parsing in our model if the parses were perfect. So here's some results. So these here are the previous state-of-the-art numbers on the college 2005 shared task. These don't use syntax. Okay, so in domain, our SA model already outperforms the state-of-the-art when using glove embeddings and performs comparably to the state-of-the-art with Elmo representations. And we see a similar trend when evaluating out-of-domain. Okay, so our Lisa model with Lisa parses performs comparably to the syntax-free SA model. So in domain. So we think this suggests that even without syntactic supervision, the SA model is able to learn information that is sort of equivalent to a reasonably accurate parser. So with glove embeddings, yeah, okay, so more detail, unlike a parsing accuracy. With glove embeddings, the Lisa model achieves an unlabeled attachment scores with like parsing accuracy around 95, which is really high. And with Elmo, it's above 96. So out-of-domain, we see that with glove embeddings, the Lisa parser performs comparably to the SA model, while the Lisa model using Elmo performs much better than the syntax-free model. And so again, this difference is due to the higher accuracy of the Lisa parser when using Elmo embedding. So it's like about three points higher out of domain. So at this high accuracy, the pars is actually providing information that the model can use to generalize better out-of-domain. So this is a really nice result. We see here that having this explicit representation of syntax does help over even a really strong end-to-end neural work model. And so finally, we also experiment with feeding in the external dos and emitting pars at test time. And this gives us the best results. So particularly with the glove embeddings, we see an overall increase over the previous state of the art by nearly 2.5 f1 end-domain absolute, and more than 3.5 f1 out-of-domain. So with Elmo, the increases are a little bit smaller. So our best model improves over the previous state of the art by almost 1 f1 absolute end-domain and by more than 2 f1 out-of-domain. And the fact that the end-domain improvement using Elmo is so small compared to out-of-domain is actually, I think, a really interesting result because it suggests that although, but Elmo representations are doing a pretty good job of learning the useful parts of syntax for this task. But that's why this, you know, we don't see huge increase when using syntax end-domain. But the syntactic representations are sort of like overfitting to the data that the model is trained on. So out-of-domain providing external syntax still does help. This is an interesting thing. Okay. So now I want to do a little analysis. So I'm going to switch to development results. And I want to show you how well the model does when it's provided with gold parses at test time. So with glove embedding, we see more than a three-point improvement over our best model. Yeah, three-point additional improvement, and nearly 2.5 additional points over our best Elmo model. So this is actually pretty substantial considering the accuracy of the parser we're using, which is 96.5 UAS. It's like, seems like kind of like the upper bound of how accurate a parser is going to be. So we did some analysis to figure out what types of errors the gold parses are actually fixing over the predictive parses. So here we performed the same analysis as was originally performed, presented by Lujanja and others in their 2017 paper on SRL. So what this is doing is starting from the model predictions, we incrementally fix common error types until essentially we've fixed all the errors. So one error type example is fixed labels, which corresponds to correcting the labels on spans whose batteries are predicted correctly. And another example is merging or splitting spans. So this corresponds to either splitting a predicted span into two gold spans, or on the other hand, merging to adjacent predicted spans into one span, which is a gold predicted span. So first we see that the most substantial source of errors across all the models is from the labels on the span. So if you just fix the labels on the spans you predicted correctly, you'd get about five F1 points, absolute. And then next we see that using the gold parse helps the model mainly to identify span boundaries much better. So all these error types that kind of like close the gap between the gold parse and the predicted parse are all sort of like, yeah, fixing these span boundaries. So this suggests that although the parse attachment score is really high, its representation of spans is somehow really hurting the SRL's model, the SRL model's ability to identify spans. Or otherwise, like attachment score is not really maybe a good metric for evaluating parse accuracy. So I think this could be some like interesting suggestions for future work. So I've presented Lisa. This is a multi-task, a model for multi-task learning where a transfer between the tasks is done through the attention mechanism. In this work we demonstrated that this approach works well for SRL, but I also think that this could help with other tasks across NLP and even maybe in computer vision. We've shown that injecting syntax helps even a really strong neural network model, particularly when evaluating out of domain. And we observed in practice this model is more efficient than alternatives, but part of future work is going to involve quantifying this difference between multi-task learning and like a pipeline of models more precisely. So in conclusion, we believe this approach is the foundation for an NLP pipeline that will provide everyone with the accuracy, computational efficiency, and robustness they need to run NLP on the entire web. And the models and code are available online. So this concludes my talk. Thanks for coming, thanks for listening, and I'm very happy to take any questions that you have. Thank you very much, Emma. Everybody is giving you a standing ovation, but audio is all on mute. First off, let me just compliment you. It is so hard to present without an audience giving you feedback. You pasted it beautifully, so thank you very much. Everybody is on mute. You should be able to get off mute by clicking the little microphone at the bottom of the screen if you have a question. Otherwise, if you're confounded by that, there is a chat window which clicks a little speech bubble at the bottom, and you can type it in, and either drew or I will try to read your questions. So are there any questions for them? Here's a question from Amada. In the page where you say are fixing various errors improved the model, exactly what do you do for fixing the errors, fixing the training data, or something else? Oh yeah, okay. So this is all in the development set. So we take that output of various models. I can go back to that slide if that's helpful. Yeah, so this is a slide that you're talking about. Yeah, so each of these colored lines corresponds to the output, like the labeling on the development set of a different model. And it's models that differ in sort of like how much syntax and how correct that syntax is that they have access to. So then essentially we incrementally, so each of the kicks on the x-axis corresponds to an error type. And so that means we went through those predictions on the development set, and we identified all the errors of that type. So for example, fixed labels means we identified of all the correct spans that are predicted by each model. We augment the predictions such that the labels on those spans are correct, and then we reevaluate with respect to this correction. And then the next one is like moving arguments around to the correct place. So then we again, we apply that correction and then evaluate the accuracy. So you see like, you expect the final accuracy to be like 100 F1 essentially, because you've actually like artificially fixed all these errors. And so when you see like the gap between the lines on here, and like when that gap closes, it's sort of showing like those types of errors where like the difference between these, the outputs of these models. Does that make sense? Yes, thank you. And just to follow up on that. So sometimes if you get the spans wrong, that could also mean the labels are wrong, right? So how do you separate these two separate errors? Yeah, I mean, I think this is like not a perfect system. It's sort of like a heuristic to see, you know, how like what types of errors the model is making. I'm sure you could like modify it, like do some like more complex, like basically just have like more buckets of error types to help with this. But like, yeah, yeah, so yeah, but it's a good. CSD, can you hear me? Yeah. So first of all, great talk is interesting. My question was, what kind of effects did you notice depending on like what layer you would add to send syntactically where attention heads and like the supervision on part of speech and dependency parsing? Like what kind of effects did you see? Yeah, this is a great question. And it's basically like we didn't really do that analysis. So yeah, we treated like the layers where we added the supervision for the different tasks, essentially as a hyper parameter, like subject to some constraints, like some sensical things like we want part of speech tags to come before parsing, which comes before symmetric role labeling and things like this. And then we evaluated just using sort of like accuracy on the held out dev set, or you know, F1. And so yeah, so I've done this question a lot. And I think it's a really interesting question. But basically, we haven't done like a more deeper analysis for yeah, how that changes like the error types and stuff like this. But I think it would be really interesting to see, especially given all these like interesting analysis that people like more sort of model probing stuff that people have done, especially in these like pre-trained language models, sort of that looking at the different layers and some of them are kind of more like better at part of speech tagging and then later layers are more better at semantic tasks and stuff. Yeah. So I think, yeah, it'd be really interesting, but I don't have, I don't really have insights on that. Where you put the multitask, like where you put the objectives, it does change the accuracy up to like a few points. And also we optimized for only SRL performance, unfortunately. So yeah. Okay. So when you were using it, were you kind of like a like deeper layers, like earlier on in the network for your best models? I'm just curious. Oh, yeah, yeah. So yeah, I can give you, you want some like concrete numbers, sort of to get a sense of like what the second was like. If it would be better to apply kind of more general supervision earlier on, because I think that there are a lot of the multitask papers that use kind of general objectives, say that like more general objectives should be like earlier on in your architecture. Oh, interesting. And so more general in this case, like how would you rank generality? Like, like NER would be more general than normalization or like disambiguation or like part of speech tagging would be more general than NER. Sure. Okay, yeah. So general, like using those objectives early in the network would lead to better results, I think, is what I've seen in other papers and stuff. Yeah, totally. Yeah. So we were familiar with some of that work. And so like the constraints that we applied, sort of like close the hyper parameter space of like different possible objectives on different layers did make the assumption that we wanted to do that. So we definitely had part of speech tagging on earlier layers, and then parsing on later layers. And something that we did find is, so yeah, we experimented with like the Conal 2005 dataset and also the Conal 2012 dataset, which is kind of like a bigger dataset, kind of like messier data, a little bit harder, I think. And so we found that we were quite like, for parsing performance to be high, we sort of like needed more layers before we're like super like asking the model to predict parses compared to the Conal 2005 dataset, which is like only news and like pretty well formed. Yeah. All right. Thank you. Thanks. Yeah, that's a good question. I don't have a good answer to it. Sorry. Do we have other questions? I was trying to ask a question, but speak on a mute. Okay. This is Luis. That's better. Yeah. Yeah. All right. So my, this is my question. So if you really think about this in abstract, what you've done is you've taken, it's produced in a center kind of low level information, which is the syntax, and then producing higher level information, which is in this case, the, this imaginary leveling. And we know that, you know, as we assemble systems, we tend to go higher, higher level. For example, next might be Q&A and next might be dialogue over data and that sort of thing. Right. So my question is, isn't here what you really are uncovering a more general direction of using lower level structure to point to how to do attention on higher level tasks? Yeah, absolutely. Yeah. I think, yeah. And we see this as kind of like a different way of doing multitask learning that's like a little bit more informed, like by kind of giving it a little bit like harder, making a little bit like, yeah, what am I trying to say? Like injecting more concrete sort of structure from lower level tasks to assist for higher level tasks. Right. So the natural progression would be, well, maybe we, maybe we can do Q&A better if we can inject this into that layer where attention is used for matching, you know, context queries, for example, and so on and so forth. Is that, is that kind of the idea here? Yeah, absolutely. Right. Okay. Cool. Yeah. That's what I thought. Yeah. I would love to collaborate with people who want to use it for higher level tasks. Yeah, we're working here, we're working on sort of expanding this to co-reference right now, which is like not quite as high level as Q&A, but yeah. It makes sense. Thank you. Other questions? Yeah, I guess my question is how do you, all the questions, how do you deliver a presentation with such a little time? What's the trick? Sorry. No, it's just a compliment. It was a very clearly, clear presentation in a very little amount of time. That was very good on you. Yeah, thank you so much. Yeah, I appreciate it. I definitely, was not the first time I've given this presentation and I, yeah, I spent a decent amount of time on it because I knew I'd be presenting it to a lot of people, so I was pretty anxious. But thanks a lot. Yeah, I worked hard on it. Hey, more questions? As if you're composing them or still trying to figure out how to get off a mute, I'll remind everybody that our next seminar is on Monday, April 22nd at 10 a.m. Pacific, 1 p.m. Eastern. And that will be recurrent progress in adversarial robustness of AI models by Pinyu Chen from IBM Research. And these should come a pace once a week, if it all goes well. Okay, any other questions for Emma? Also, feel free to email me if you think of something later. Like, I do try to respond to email. And I'll repeat one thing from the chat. Our goal is, I mean, we are recording this and our goal is to post these to the YouTube channel and as soon as we figure out how to do that. Posting it to YouTube is trivial. Posting it to an IBM YouTube channel probably requires at least one step. I don't know. Yeah, some kind of approval mechanism. Or just figuring out a magic incantation. So, any other questions for Emma? Okay, but again, one more time. I want to thank you very much for being our leadoff speaker and doing such an amazing job. Thank you very, very much. Yeah, no problem. Thanks for inviting me. All right. Thanks, everyone. Thank you. Thank you.