 Right. Hello. I'm Brett Halpern, the Scientific Director of AI Horizons Network, and this is our weekly AIH in Semihar series. Today, we have Jungo Tassai from our University of Washington talking about low-resonance, low-resource, deep entity resolution with transfer and active learning. Sorry about that. Jungo is a PhD student in CSU University of Washington, advised by Noah Smith, and he works in natural language processing, machine learning, and this is from ACL 2019. So Jungo, take it away. Okay. Thank you for the introduction. All right. So I'm presenting low-resource, deep entity resolution with transfer and active learning today. I originally presented this work at ACL 2019 in Florence, and I conducted this research while I was an intern at IBM Research Alamedin last summer, and this is showing work with Kun, Sairam, Yunyao, and Zushin from the IBM Research Alamedin. Okay. So here's the contribution of our work. We propose a deep learning-based entity resolution model that allows for easy transfer learning from high-resource to low-resource scenarios, and we also establish an active learning algorithm to further adapt the transfer model to the target scenario. And lastly, we provide extensive empirical evaluations over benchmark data sets and demonstrate that our system achieves state-of-the-art performance while using only an order of magnitude for your labels. Okay. So first, background and introduction about entity resolution. So roughly speaking, entity resolution or ER is the task of identifying different presentations of the same real-world entities across different databases. So for example, we have citation records in the ACM database, and we also have citation records in the Semantic Scholar database, and we're going to find which record in the ACM corresponds to which record in the Semantic Scholar. Now, here's another example. We have Twitter users and Facebook accounts in the world, and we want to know which Twitter user corresponds to which Facebook user, and why do we want to do this? Applications of entity resolution include knowledge-based creation, text mining, social media analysis, and also ironically, entity resolution or ER is also called in many different ways in literature, such as record linkage, merge curves, and entity matching. But the fundamental problem behind these problems is pretty much the same. Okay. So, well, you might think ER can be simply done by just looking at string similarities. However, it turns out the world is very complicated. There are many traditional challenges. Well, several challenges in entity resolution. First, we have name ambiguity, such as Michael Jordan in the basketball and Michael Jordan in machine learning. And this is very challenging. For example, when you disambiguate Chinese names in science papers, because there's many overlaps. There are other problems like data entry errors, missing values, and abbreviations. And we have to know, for example, ACL stands for association for computational linguistics when we deal with scientific papers, not Boston city limits, etc. And in fact, if you're an LLP researcher, you know the pain of Googling ACL or looking up ACL in any search engines. We always get these different entities like the music festival and the soccer league. And we have yet to solve this problem when such a solution when we do entity lookups. Okay, so in literature, there are two major strands of models when solution. One is rule based systems. And the other one is learning based models. And in the rule based systems, we define similarity functions between strings to make matching decisions of entity representation. And on the other hand, learning based methods train statistical models on the training data, such as SBM and decision tree and apply it to any evaluation data you want to use your answer solution models for. Okay, so in this work, we focus on the machine learning based approaches. In particular, we focus on recent deep learning based solutions. So these models have the benefit of avoiding the need for defining features for every single answer solution scenario, unlike other machine learning based methods, because we use distributed representations instead of feature engineering. However, deep learning based models require many labeled examples to perform outperform other learning algorithms, such as SVM and naive base. As you can see in this plot, only after we get enough labels does the deep learning based method plotted in blue here. I'll perform the others. So to address this issue of data hungryness. This work establishes a novel framework for lawyers so statement solution. So now before discussing approaches, we formalize the problem of answer solution. So given two sets of their recollections from two databases. The one and the two, we classify each pair, you want to need to in the Cartesian part of the one cross the two into matches and no matches. Ford, you want to need to is represented as a tuple with attributes as, as you can see in this table. So in the, in the case of citations, we have several attributes, such as author title and year and using these inputs. We want to identify the same entities across databases. And those two databases even be to can be the same single database in which case we have the special case of answer solution, which is called duplicate detection within a single database. Now, conventionally, we solve this problem for solution by two steps. The first step is called blocking where we reduce the Cartesian product by filtering out obvious mismatches. Because the Cartesian products is too huge. And we want to make it track double. And for example, we can eliminate tuple pairs of papers that have different year attributes. And the second step is called matching where we classify the remaining pairs in the candidate set after blocking to identify actual matches. And our work focuses on this matching step and fix the blocking strategy in each ER scenario, regardless of the methods we use to. So this ensures that we make fair comparisons across different matching methods because we use the same blocking strategy regardless of the method. Now, a framework for low resource deep and solution consists of two overall approaches. The first one is transfer learning, where we develop a deep learning model that allows for transfer learning from a source scenario to a target scenario that has no labeled data. We can simply train a deep learning and solution model on the source data set and then use the same network parameters for the target scenario. And active learning target scenario, a small number of informative, informative samples. Now we first present our transfer learning. So we take a similar deep learning architecture to Moodleville and others 2018 to compare a pair in the candidate set after blocking. So for example, tuple A comes from DBLP and tuple B comes from Google Scholar. And we want to compare those tuples across these two databases, DBLP and Google Scholar. So for each attribute, such as the author attribute, we construct vector representations by using bidirectional GRUs on top of the FASTX pre-sharing vectors. And the FASTX vectors are obtained from subword representations, which is particularly important to address the problem of out of vocabulary problem that appears frequently in databases such as proper nouns and technical terms in these databases. So for instance, we get vector representations for allenturing for tuple A and we also get a vector representation for tuple M for tuple B by the same bidirectional GRUs. And these representations are reflective of subword information. So we can make sure that these words are, even if they are out of vocabulary, we get proper vector representations. Now once we get attribute vectors, we compare the two tuples by taking the absolute difference between these two vectors. We make these comparisons across all attributes, not just the author, such as title and year in the example of citation genera. And once we get those vectors, we add those similarity vectors to represent the overall similarity between the two tuples. And this addition operation is critical because it ensures that dimension, the final dimension is the same regardless of the number of attributes that we have in the scenario. And this allows us to transfer the network parameters between scenarios that have different numbers of attributes. So for example, Google Scholar has full attributes while the Quora scenario has a attribute such as the publisher name in addition to author names and year. But we can still transfer all network and readers between these two scenarios without any modifications in terms of the architecture. So finally, we pass this similarity vector to the feed for the matching classifier and predict a match or no match. And we can train this network by simply maximizing the log likelihood of the label training there. Now we can apply the domain adaptation technique for transfer learning from computer vision literature in addition to these simple building blocks. So in particular, in addition to the matching classifier, we pass this vector to the feed for domain classifier as well. And importantly, we create a gradient reversal layer between the domain. And this ensures that the domain classifier is trying to distinguish the source and the target scenarios while the rest of the network is trying to fool the classifier, meaning that it is encouraged to develop domain-dependent representations which are useful for transfer purposes because they don't distinguish between different scenarios. Can I ask a question? Yeah, sure. In the previous slide, you assume that the attributes in the two databases have been already matched in the previous one? Yes, that's right. Yeah. Okay. So in every scenario, you have to do sort of schema matching, right? You're right. But the target scenario could have different schema results, right? Yeah. Like I said. Okay. So, right. So, good. So now we present our transfer learning results. Here, we assume we have no target labels. So we can see that by adding the domain adaptation technique, we generally improve transfer learning performance on the target databases, DBL, PACM, DBLP Scholar, and Quora. However, there's still a huge gap between the transfer learning results and the results we get by training the network with all target labels. Okay. So remember that the transfer learning results, we don't assume any target labels. Now, this motivates us to explore active learning as a method to further refine the transfer model to close the gap between these, well, the high resource scenario and the low resource scenario. So we present active learning. Again, active learning further refined the transfer model by iteratively labeling informative examples from the target. But there are two major issues to achieve this goal. The first problem is that we need substantial labeled examples to tune the transfer deep learning model without overfitting to the subset of the labels. Now, the second problem is specific to the problem with entry resolution, which is since the distribution of entry resolution labels is skewed toward a negative or no matches. It is hard to find false negatives and to improve the recall of the system. Because of course we have way more no matches than matches when you compare different databases, even after blocking. And in order to address these issues, we propose an entity, well entropy based sampling method. So let H of X be the entropy of the prediction on input X given by the current model. Then on certain examples of size K can be defined as the set of pairs with top K entropy. We can manually label those uncertain and therefore informative examples for the model. And we can tune the current model to these new examples and repeat this process iteratively. So this is sort of the most naive way of doing active learning for entry resolution. However, the two problems we just mentioned still remain. So again, we need substantial labeled examples to the transfer to the transfer network parameters that manual labeling does not scale. And this tuning method will suffer a recall because of the nature of entry resolution labels. Now, in contrast to uncertain examples, we can also define high confidence samples as the set of pairs with bottom K entropy. And we can manually label uncertain and therefore implement examples while automatically label those high confidence samples by model prediction as a proxy. Now we can then tune the model to these examples and this repeat this process iteratively. So this solves the problem, the first problem we just discussed by adding high confidence examples with labels as, with automatic labels as a proxy. We can avoid overfitting only to manually labeled examples. But again, problem two supersists such tuning will lead to a model with lower resource or lower recall because the entry resolution examples are skewed toward a negative or no matches. So in order to solve this problem, we further have to divide high confidence and uncertain examples by the current model predictions. So in particular, we define high confidence positives, high confidence negatives and certain positives and uncertain negatives using entropy and the current model predictions. So these four classes respond to likely true positives, likely true negatives and likely false positives and likely false negatives. For example, pairs that have high entropy and are classified as positive by the current model are likely false positives. So we sample these four classes of pairs we just defined equally and tune our model to the sample of examples. So we later show that this sampling strategy with partitioning with respect to the four classes actually helps our active learning substantially. So now we present our results. We focus on the citation genre here because we have expensive benchmark data sets. So the X axis here indicates the number of examples we manually label. And the Y axis indicates the F1 score of the system on the same evaluation data. Now we see that deep entry resolution model we transfer an active learning, which is plotted in blue here, achieves the best performance with faster conversions as compared to the other machine learning based methods. And also the vanilla deep learning method auditing red here. And again, we see that deep learning model is better hungry and under performs SVM under this low resource setting where we only have 1000 labeled examples for the target. However, by adding active learning, we significantly outperform SVM. And in particular, the combination of transfer and active learning. The active learning alone already eos performance better than SVM and other machine learning algorithms we tried in the paper. But when you when we combine transfer and active learning together, this eos the best performance. Now you might say that SVM can also perform much better than this by incorporating some sort of active learning. Well, there are many active learning methods in literature for SVM. But our results show that deep intersolution with transfer and active learning even outperforms SVM with full training data. So this suggests that deep intersolution model with transfer and active learning is strictly better than the performance. Well, better than the SVM performance since the performance with a full bit labeled training data can serve as not sort of an upper bound. So the same patterns hold for other scenarios such as the core citation data set and we see a significant performance gain by transfer and active learning. We also test some models on other generalized incitations in the restaurant genre. We again see that transfer learning alone doesn't suffice eos a reliable system but when combined with active learning we get better performance. Now in the software software genre we again see that deep learning with active learning plotted in blue here achieves the best the best performance in a low resource setting. And we can see you can see more results in other genres in our paper. So now we want to give an analysis on our approach. We presented several approaches and we want to know whether how effective these approaches were. So the question is how important was our sampling strategy in active learning. So we compare different sampling strategies here. And the first strategy is to just to take top K entropy pairs and annotated them each iteration manually and tune the model to these manually labeled examples. So if we partition these manually labeled uncertain examples into likely post positives and likely post negatives. And instead of just sampling them we can equally sample them from these two classes right and then by doing this partitioning we get a performance boost. So specifically we see that recall improves dramatically while keeping the high precision. So by combining both partitioning and high confidence sampling or high confidence automatic labeling we just discussed we get the best performance both in terms of precision and recall. And this means that in addition to the partition mechanism we just introduced the automatic labeling also have to deep learning model to tune better without overfitting. So we can also look at the actual breakdown of the manual manually labeled samples using the goal labels. So when we do experiments we don't assume any goal labels but when we look at when we give an analysis we can use the goal labels. The partition indeed helps us find more balance samples here. But interestingly we sample more true positives. So ideally we want to find we want to find the force negatives and post positives here. But here by doing this partitioning mechanism we sample more true positives because we aggressively choose likely post negatives. But again these post negatives are generally challenged to find because of the nature of inter-resolution where the labels are skewed toward a negative or mismatches. So in conclusion our deep inter-resolution model EOS competitive performance to state of the art while using an order magnitude fewer labels. We also saw that transfer learning alone that is suffice to construct a reliable inter-resolution system. But when combined with active learning we get stable and high performance. And this work provides for the support for the claim that deep learning can provide a unified integration method even in lowest of settings without the need for future engineering for every single year scenario we have. And for future work we are interested in applying our methods to more complicated scenarios, genres and also knowing which languages. And we are also interested in applying the transfer and active learning frameworks to more problems than the answer solution. So lastly we thank Sid Muttagal, Vamsi Mendri and PbMoleCare for their help with this work. And I'm happy to take questions. Thank you. Thank you Joseph. Thank you very much. If you want to ask a question, please unmute yourself. This little red microphone icon at the bottom of your screen. And I see Hamid's already unmuted. So Hamid, you want to start? Yeah, yeah. So do you have, so can you use also this active learning method for fusing the entities? So so far you connected them, matched them, but then now I have to fuse them, right? You know, to take like attributes from different places that are similar and combine them. And that's a manual process that's very, very expensive. So can you use the active learning with your sampling to reduce the cost there? Can I use sampling of active learning to do sort of schema matching your question? No, no, the fusion part. So all you did was the linking, right? But at the end, I have to combine the attribute values from different places. And they're similar as well. And they combine them into one result. Yes, yes. They're connected. Got it. So in this work, we haven't tried it. We used the benchmark data sets that already have already done like the fusion process you just mentioned. But in principle, yeah, we should be able to do active learning or not just the linking part, but also the future part. Yes. But in reality, it's sometimes challenging because there's different types. So you have to tune the model well. But yeah, that's an interesting question. We've done it. We have done it. Yes. Yeah. Any, any other questions, please unmute yourself. Okay, I have one. So you were, you were matching across to publication databases. Is there a sensitivity that those are going to be, I mean, you talked about schema mapping and you got to do some. Relatively similar data sources. If you did publications and resumes, I mean, the same information is there. Is it just going to be, is it going to be just like two publication databases or is the heterogeneity going to maybe. Yeah. Yeah, that's a good question. So one of the actual advantages of the deep learning methods is you don't have to do a feature and during. Well, you have to do, you have to tune the model, but we'll hyper, you know, tune the hyper parameters. But you don't have to find different features for different scenarios. So we hope that this method will work better when it comes to sort of more textual data. So actually, there's some cry work that showed that deep learning methods of a form other learning based methods substantially. When they work on some textual data. So instead of having these. Relatively strict sort of relatively structured data like titles and year and authors, we can have more textual data like product descriptions that can be a little that can be messy, that can be a messier. And we think that this method would work scale better to these highly textual data that relatively unstructured data. But that's sort of future future work and we are sort of lacking. Nice sort of empirical, well, benchmark dances right now, but it would be great if we can try on sort of even private data. If we can try that. So I got another question for you. Yep. Okay, so did you consider any kind of confidence is full as a few of the sources. So for example, say Google scholar confidence is a data person. So did I use confidence course to do for active learning or I don't know for the matching part for the matching part. So we didn't, I see. So we only, yeah, we train this model on the negative log likelihood. So that's sort of the only place where the probability came in. In the matching part in the active learning part we also looked at the confidence levels and certain examples. We are not making use on this course. Yeah, so in active learning part that influence of your sampling or, or you say, the expert on a look at, you know, high confidence versus the confidence and just disappear. Right, right. No, is it like automated or automated right so we use these. So, we have this transfer model and the transfer model gives you the scores. On the label, sorry, on the pairs that we don't have any labels for. So it's automatic. It's completely automatic. Now we can also look at the correlation between actual human human annotations. So humans can also give some confidence scores and we can also compare that with model predictions that would be interesting. Maybe in some cases it's very easy for them for the deep learning model, but it's very hard for humans or something like that. But we haven't done it. Okay, any other questions. If not jungle. Thank you very much for the, for the very cool presentation and the good slides. There will not be a seminar next week because we have the horizons. AI horizons network is part of a research week in Cambridge. We hope to continue the week after that. We're going to try to include also talks about data sets and tools that can be shared with our colleagues in the universities. So, stay tuned and we'll have an announcement out hopefully soon for the next seminar. Thank you. Thank you.