 Good morning. Good afternoon. I'm Brent Halpern, the scientific director of the AI Horizons Network. And this is our weekly seminar series. Today we have a talk on sentence-emitting alignment for lifelong relation extraction. This is by Hong Wang of UC Santa Barbara. Hong is a first-year PhD student in the CS department there. He's interested on focusing on lifelong learning and few-shot learning of fundamental problems such as relation extraction. I will let everybody know that you're coming in muted into the WebEx, in which case we would generally suggest that we hold questions for the end. If you do need to interrupt, you need to unmute yourself or you can post a question in the chat if you want to make sure to remember it. The unmute button is in the menus down at the bottom with the little microphones, and the chat is one of those. It's a speech bubble down in the menu at the bottom of the screen. So without further ado, Hong Wang. Hi. Hi. Thank you for the introduction. So it's my pleasure to share with you our NECO paper since it's embedding alignment for lifelong relation extraction. This is joint work with Wenhan Xiong, Mo Yu, Xiao Xiao Guo, Siyu Chang, and William Wang. So first, let's first talk about the task of relation extraction. This task aims to automatically extract the relation for a given sentence, and it has been widely applied in many downstream tasks, such as question answering. So here is an example given our question. Where was Obama born? The system could distract the relation mentioned in this sentence, which is born in. And we can use the head entity, Obama, and the extracted relation born in to query in the knowledge graph so that we can get the answer for this question. Although the relation extraction has been widely applied in many real-world applications, they have conventional approaches to solve this problem. You only assumed a fixed training data, and they apply once and for all training pipelines to solve this problem. We think we'll have some problem under the case that new relations are emerging very quickly. So what can we do if there are new relations? Conventional approaches may have to retain a new model in order to fit in the new data that's like this. So they can mix the data from new relations with the existing training data and retrain the whole model on all the data. This approach is doable, but of course it's not best, especially under the case that if you have a lot of training data, you may training a new model may spend a lot of time, or if you have new relations every hour or every day, retraining your model will also cost a lot of time. So we think one possible solution for this problem is to apply life-long learning approaches. So what is life-long learning? Life-long learning aims to solve the problem that for the under the case that new relations or tasks are coming in. So it trains a model to not a set of tasks in sequence without forgetting the knowledge on previous tasks. For example, here is a model F and there are a set of tasks, so task one, task two, and two tasks N. And the model F will make first null on task one and then null on task two. So as each type, the model F only needs to null on the data from a new task. Compared with retraining a new model on all the data it needs, it needs much less computation and save a lot of time. But here is another example to show how it works for our relations direction. So consider we have three relations here and we have a randomly units as the model F. The model F will first be trained on the next first relation for A and we get a new model. And when we have a new relation, we train the model on the data from the new relation at nationality. And we'll get another new model and when we have another new relation, we train the model on the new relation. And after we train the model on all the relations, we can test the performance of the model on all the relations to measure its performance. So national learning is actually natural for human beings. For example, when you learn a new language such as HTML, you will not forget the knowledge or skills you learned about the C++ class. But for machines, especially for the wisely used model neural networks, national learning is a hard thing. We know that the neural networks store knowledge in its weight. So when we use the neural network, adapt the neural network to a new task, its weight maybe will be changed in order to fit in the new task. And the changed parameters maybe will no longer fit for previous tasks. Like when we are on task two and the new model will maybe no longer fit for task one. And when we are on the new task three, the adapted parameters may no longer fit for previous tasks. This phenomenon is called catastrophic forgetting. And it's severe for the neural network models since we use a great and decent update of parameters to fix the new task. And the updates of parameters can be easily maybe not fit for previous tasks anymore. So what can we do to make the neural network models work under the national learning setting? There are some previous works. Like one way is that maybe we can adjust the learning rate for the parameters and we decrease the learning rate of the parameters that are important for the previous tasks. This approach is called EWC. So they use the facial information metrics to measure the importance of a parameter to a previous task. And so they add another node on the nodes function. And if the parameter is far away from the original parameter and the facial information is large, it will cause incur a large penalty on the nodes function. So by introducing such a node function here, it will decrease the learning rate on the important parameters since the nodes will be large if they change that parameters. And you may notice that such an approach didn't apply any memory or samples from previous tasks. But if we have memories and we can save some samples from previous tasks, we can do better. And this is actually a reasonable assumption since we can store some samples since we have the memory. And for humans, it's also natural to use the previous experience to help us to the new task. This is an approach that was published in 2017. It's called GM. So the key idea is that we save some samples for each previous task. And when we are on the new task, we hope the updated gradients like the G-tiles will not only benefit on the new task but also benefit on previous tasks. One assumption that they use is that they think if the updated gradients is the angle between the updated gradients and the gradients on previous tasks is 90 degrees. So then they think the updated gradients will also benefit that task. So they write their optimization problem in this form. They hope after computing all the gradients on the new task and the gradients on the previous task, they hope the computed updated gradients could be most similar to the gradients on the new task. And the updated gradients will also benefit the previous task. So the angle is, or the product between the updated gradients and the gradients on previous tasks is larger than zero. This kind of approach has a problem is that there is computationally expensive because every time I update the gradients, I will not only need to compute a gradient on the new task, but also I need to compute a gradient on previous tasks. And as the time increases, we may have much more previous tasks, like hundreds or thousands of previous tasks. At this time, each update will become snow since we have many, many tasks that need to compute gradients. So when can we do faster than GM? The answer is yes. So the AGM is meaning of proposed to solve this problem instead of computing the gradient on each previous task. They propose to mix the data from previous tasks together and regard all the previous tasks as a mixed task. And they sample data from all the previous tasks. So they only have one constraint that the updated gradients should be in the right angle with the previous task. And they can have a close for solution for this problem. So compared with GM, AGM only needed to compute one additional gradient on the data from all previous tasks. And it's much faster and it can achieve similar performance as GM. But we're wondering if the operation of projection really necessary or not. So if we look at the projection operation, what it really do is that it aims to find the gradient that is most similar to the current one and will not violate much on the previous one. So in other words, that the updated gradient will most benefit the current task. And we think such a constraint on previous tasks maybe is not strict enough since it only requires the angle is within 90 degrees. And maybe through this way, this performance will also drop. So here we propose our another simple approach to this problem instead of project the gradient we use the average of the gradient on the new task and the gradient on previous tasks to update our model. And here it's worth to mention that there are two ways to sample previous tasks. One way is that we can sample from all previous tasks by mixing the data or the data together. Another way is that we can sample one previous task and they use the data from that particular task. The performance are similar with the two approaches. But one problem is that the first approach may not be appropriate on this on case. For example, for some tasks may have different input, input dimensions. So under this case, you can't mix the data from different tasks together into a single batch. At this time, you have to choose the second approach to train data. So in the same batch, it only has the data from the same task. And here we propose two new benchmarks for the next long relation stretching task. Based on the data set of a few relations and questions, we use the k-means to cluster on the relation name, the name of the relations into several groups. And like for example, a few relation data sets, we cluster the 18 relations into 10 groups. And we cluster the relations for some more questions into 20 relations. Each group of relations is regarded as a task in our setting. So how good is our simple batch size, the EMR? We compare this batch size with the GM and AGM on our proposal that's not relation-distraction benchmark. It turns out that this simple batch size also performs GM and AGM. Even on the popularly used benchmarks in computer vision, like MNIST rotation and MNIST permutation, our SIFA 100 hour approach actually has full competitive performance at the Delta Arts method GM. So we can see that actually the project is not necessarily for the case that under our left on relation-distraction task, the AGM actually performs much worse than our simple batch size do not use the projection. The problem is, can we do better than the simple batch size? So let's have a closer look of the statistic for getting. Here you can see the dots on the plate as the embedding for a sentence. And when we narrow on a new task, the embedding for the sentences of the previous task will be moved to a new relation, thanks to the pink dots here. So since the embedding for the previous task changes, their performance will drop as well, and this will cause the statistic to forget. And we are thinking about, can we explicitly model the change of the embedding so that we can move them back to their original position where they have best performance. Here we propose to add another operation called alignment. So this is the predation for the sample for the sentence in previous task change. We use the alignment operation to map the changes of predation back to their original position. So here the red dot is the alignment position, which is much more closer than to the original position than the new operation after we're training on the new task. So this way we hope, since we aligned the embedding for previous tasks back to the original position, hope their performance will not drop much. So specifically how to achieve this. Here is our alignment objective. In this object function, there are many two parts. The first part is the classification error of our problem. And the way at an additional north, which is called distortion north, this north arms to minimize the distance of the embedding for the sentence to their previous embedding. So this north and the purpose of this north is to align the embedding back to the original position. In order to train the objective, we propose to use two step training. The first step is that we train the model on the new task and the state samples to minimize the classification error. And after that, the embedding for the synthesis may be altered. And in the second step, we do the explicit alignment to align the embedding for the samples from previous tasks to their original position. Here, this is the model we use for the relation instruction cost. And the lower part is the model to predict the embedding for a sentence or the destination. And based on that, we add the alignment model to do the explicit alignment. So in the first step, we train the basic model to minimize the classification error. And after that, we do, we train the alignment model to minimize the additional error to map the embedding for the, for the synthesis in previous tasks back to their original embedding. And it's worth to mention that the key component in our algorithm, which is that we need to select samples for each task. Here we will use or propose to use k-means to choose samples. For example, if you, if we needed to choose 15 samples for each task, we will first cluster all the samples in that task into 50 groups and the nearest sample to the center of each group will be chosen to save in the memory. For the experiments, we, for the experiments, we compare our method with several business mentioned before like GM, EWC. Also, we compare with our method with the original business, which do not use any additional operation. So we can see that our method performs much better than other best names. And on a simple question to the benchmark, our method didn't have much job and maintain the high accuracy as we see more tasks. So here are the x, the x-axis is the number of the tasks we have seen. So, and the y-axis is the average accuracy on all the observed tasks. And this table leads to the concrete numbers of accuracy of this method as the next, as a last step. And as you can see that the two key components of our method, the sample selection and the alignment model, both of them have the, have some improvement on the basic EMR model. And using the whole model will have the best performance on their most cases. Also, we conducted some experiments to compare different methods to select samples. So we compare our method to that plan k-means, to the blended selection, which here is the EMR only. And we also compare it with another approach called SARL, which to sample that best approximate the distribution of the original data. So we say that our k-means solution has the best performance in our experiments. So the main further conclusion, in this paper we introduced the life-bound learning into relation extraction tasks. We think it's more practical since new relations are emerging, or emerging in many practical problems. And we propose a best line called EMR, which only which you, which is quite simple that use the average of the gradient on the new task and on the gradient on the previous task to update the model. So surprisingly, the simple best line outperforms the current steps of method DM and EWC. Based on the EMR method, we propose to use the sentence embedding alignment, which explicitly do the alignment in the embedding space to recover the embedding for the previous task. So this way, we can see that it can better eliminate the class drop-in problem. So thank you for your attention and any questions. Thank you very much, Hank. I appreciate you giving that very clear talk. A reminder to questioners that you're going to need to unmute yourself, which is clicking on the little red microphone on the menu at the bottom of the screen, and that the talk is being recorded and will be posted on YouTube. So if you're from IBM, please keep your questions or comments non-confidential. I see Ho Chi-Al is unmuted. Do you want to start? Yeah, so this is a question from IBM Alamedin. Yeah, we are getting a couple of people together here. Okay, this is Sanjana. So for selecting samples from previous tasks, is there a limit or threshold on how many samples you get from previous tasks versus how many samples you keep from the current tasks so that it's also task relevant? Do you mean how many samples we keep for each previous task? Yeah, each previous task versus like the current task itself so that there's still task relevant. Okay, so in our experiments, we keep 15 samples for each previous task. Can you repeat that? So in our experiment, we keep 50 samples for each previous task. 50 samples? Yeah, 50 samples for each previous task. So it's very small compared to the whole data from that task. Okay, in terms of percentage, how would that be split? Sorry, could you repeat that? In terms of percentage, how would it be split across tasks? Is it the same for every task compared to the current one? Yeah, so for the current task? How much weightage do you give for each task while you pick samples from it? Is it evenly distributed? I'll send you an email later, but the next question was like, how do you decide on the K and K means for each task? Okay, so for the K means, we conduct the K means on the embedding like here. So here we use the embedding of our model to conduct the K means approach. So we use K means on the embedding of each sentence, and then we choose 15 samples for each previous task. We will cluster all the samples into 15 clusters and choose the closest and the average sample to the center and to store in the memory. Okay, thank you. So was that K size arbitrary, or was it somehow tuned to the data set? The K being how many clusters? Yeah, so I think it depends on your memory, how many data you decided to store in the memory. So, for example, in our experiment, we use 15 samples for each task, so we cluster into 15 clusters. Okay, any other questions going once or twice? All right, again, thank you very much for giving the presentation. Thank you to the folks who stayed on. We don't have a seminar scheduled yet for next week, so my guess is we probably won't, but if there will, we'll send it out via the usual mechanisms. Our next schedule is June 24, but I'm hoping that we'll have at least a couple more between now and then realizing that summer schedules are a little bit harder for everybody. So again, everyone, thank you very much and thank you very much for doing the presentation.