 And we're live. Hello, good evening or good morning, depending on where you are. Welcome to the March research showcase. My name is Miriam. I'm a research scientist at the Wikimedia Foundation research team. And today at the research showcase, we have two amazing presentation from Thomas and Besnik. And in these hangouts, we have a bunch of other members of the research team, Jonathan, Isaac, Leila, and Diego. So first up today, so what's going to happen is we're going to have two presentations. And after each presentation, we are going to have a few minutes for questions. And then maybe at the end, we can have a discussion or if we have some other questions for the speakers. So first up today is Thomas, a PhD student from Telecom Polytech, who's going to present his recent work about learning how to correct wiki data. This is work which is going to present it at the web conference in May. And so I think with that, Thomas, the stage is yours. Thank you. Let me set up my slides. There we go. So I'm going to present our paper that we called Learning How to Correct a Knowledge Base from the Edit History. So actually, we did most of the experiments of this paper on wiki data. So I'm going to focus the presentation mostly on wiki data. But on the actual paper, we present formalisms that could apply to most of the knowledge bases. So if you are interested in something more general than wiki data, please just have a look at it. But in this presentation, so this is joint work with Camille and Fabienne. I'm just going to focus more on wiki data and how we could fix the constraint violations on wiki data. So as you know, wiki data is kind of messy. You have a lot of violations of wiki data constraints. So wiki data constraints are mostly saying that, for example, you should have only one birthday or set of possible support values for the sex or gender property should be male, female, transgender, male, transgender, female, and so on. But there are still a lot of violations for this constraint. So as you say, in statistics from July 2018, we still have constraints like there was more than 3 million violations. So in this work, we wanted to try to help to see if we could find good automated ways to help reducing these kind of violations. So if you have already worked on fixing these violations, there are often some patterns that could emerge. So for example, here, we get to take a look at an item. You see that you have a verisense for the sex and gender relation. You have the woman value, and you see that there is a problem. Because it's not supposed to be one of the possible value for the one for the sex and gender property. And you see that, probably, if the value is woman, it should be replaced by female. And so you could write it as a rule, saying that if there is a violation for one of constraints for the sex and gender property, where the value is woman, you should replace the value of the woman by the value of female. If we take another example here, we have an entity about a person. He has a place of birth, and then you see there is a violation on the value type constraint for place of birth, saying that actually the value should have a type that is a geographical object or a fictional location, and so on. And then you see that the value is in a country. So then you could probably infer that it makes sense to add the type geographical object to the value. And so you could also write it as a rule, saying that if there is a violation for this place of birth relation for this constraint, and you have an object, this relation has the country property, then you should add the type object to this value. And so in this work, we want to mine such rules to suggest the people that curate the knowledge business and people that edit the data, simple edits to solve these violations. And so why are we focusing on rules and not all those things, like, for example, more statistical machine learning approach is because rules are explainable. The idea is that if you suggest an edit to the user, you could say you have applied this rule. And so the user could know what the algorithm did actually to suggest this edit. And so also the nice thing is that rules works well with new entities. So for example, if you add a new entity in Wikidata, the rules could easily apply on it. You don't rely on things like embeddings. You may want to learn on all entities and so on. And so for retrieving, for any real rules, we need data. And what is very nice with an edit space like Wikidata is that you have a full edit history of all what Wikidata contributors have done on Wikidata. And so you can have a look at it and see, for example, if before a revision, you have a violation. So here you have a value type violation. And then there have been an edit. And then the violation disappeared. Then probably the revision, then you know that the revision fixed this violation. And so hopefully the revision is the proper way to solve this violation. And so what we did is that we extracted that from the Wikidata history to be more specific. What we're finding is that actually when you have a look at constraints, there is two atomic ways to solve a violation of a Wikidata constraint. It's actually to remove a fact or to add a fact. For example, you see that here, for this value type constraints for place of birth, the two possible ways to solve the violation is actually remove the place of birth fact or add a type. And it generalizes on all constraints types in Wikidata. And so what we actually do to build the data set is have a look at the, for each constraint type, you have a look at the edits that does the same. So for example, here we move a place of birth, of add the geographical object type. And obviously there are subclasses and also for fictional locations and so on. And then we check if it's actually, if these edits were actually fixing constraints violation. And if it is the case, we just add it to the data set. So for that, what we did is that we actually wrote Sparkle queries, not against official Wikidata query service, but with branch or history query service that basically holds all the, so what we have in it is the Wikidata, main value relationships, also WDT relations, you could have in Wikidata query, but we have it for all Wikidata revisions with all the revision metadata. If you want, it's enabled on Wikimedia cloud services. So I'm going to leave the link at the end of this presentation if you want to use it for doing research on Wikidata. So it has been very helpful for us. And so with that, we have been able to extract more than 75 million of past corrections on Wikidata. And so we created a big data set. It's also available online. So the link is also to be at the end of the presentation and is also in the paper that basically states at this given revision, this violation. So here, for example, in the first line of my example, the value type constraint for place of birth for the person called Matsubaccio with a value is for a given specific value led actually to this edit. And so we have it for all our samples. And from that, we were able to actually learn more. So how do we do it? So as I just said, the goal is to generalize from this data set to here, for example, have three ways to solve, three examples of solving or example constraints or the value type for place of birth. And we want to basically generate the ideas to generalize these examples to get a rule. So for that first, we need to do it in two steps. The first one is to generate rules that may be interesting. So for that, we take each rule, basically, instantiated rule in our data set that corresponds to also pass corrections. So you see here, it's not a rule. It's just one, let's say, correction of a pass violation. And then we generalize it by basically replacing all nodes. So we basically each constant variable. Here, we replaced most of the constant variable, but you could also, obviously, we also generate rules where more constants are still there. So it creates a lot of possible rules. And then we also want to expand it to take care of the context. Because for example, we have seen often the context is useful. For example, if the value of place of birth in the country, then it's likely that it is a geographical object. But if the value of place of birth has a birth date, then it's probably a mistake. And we should remove the place of birth statement. So having the context is interesting. So we wanted to keep it in the rules. And then so we have generative as well that we want to find which are the good ones we want to keep. So for that, to explain it, I need to introduce some vocabulary. So here, we have a rule. And we define what is a rule body in a rule head. So the rule body is basically the if part of the rule. So here, it's if there is a violation for the value type constant on place of birth for variable s and variable p. And then in the wiki data graph, you have that p as our statement with the country relation. So this is a rule body. And then you have the rule head. That's what we are going to do is we are going to add the statement or remove the statement and so on. And so basically, the rule body and rule head could be seen as queries on our past creation data set plus the wiki data graph at the revision where the relation solving happened. And so we could define the support of the rule body and the support of the rule head. As a number of times, the corresponding query matches. So for example, if you have a look at the support of the rule body here in this example, as a number of times, we have add in our data set a violation for the value type constant of birth and that the place has a country statement. And with that, we could define how we run the rule. So our intuition for that is basically saying that a rule is good if it predicts the correct changes. So the idea is that if the rule predicts the good corrections and it's fine and it's a good rule, it does not predict the good corrections. It's a bad one. And so we define the confidence of a rule saying that it's basically the number of times the rule could apply and applies and gives the correct output or the correct edit. And divide it by the number of times the rule applies in the data set. So if I take an example, the rules are saying that we should replace for the, if you have a violation of the one of constraints for the gender relation with the value woman, you should replace the value woman with the value female. And we have a data set of three examples of corrections for these constraints, two that replace woman by female and one that replace woman by male. We say that the body of rule R here is just saying that the violation matches three times in this example data set. But the rule ahead, so the edit we should do, is only correct twice. And so the confidence of the rule here is only two thirds. And so with that, with these two pieces, we could do our remaining algorithm. So we just first take our data set of past corrections, we generalize it by interesting variables. And then we just remove out all of the basic rules that have a true low body support. So the idea is that we don't want to keep the rules that only apply in the data set in one or two or three cases because they are not generally enough. And so let's say not interesting enough to keep, to predict corrections for new variations. And then what we do iteratively, we add context reponse. And then we only keep the context reponse if they keep a good enough body support to keep having general rules. And if they also, and we want also to keep it only if they increase the rule confidence. So they had a value because we do not want to have a context that does not improve the rules and does not give new information that helps to find a good correction. And then we have iterating on this algorithm. We just keep the rules only that have a good enough confidence. So in our experiment of someone, we just keep, for example, keep a threshold on 0.5 on it. But we could pick some other thresholds, and it also gives interesting results. So we applied it on our 80% of the past correction we made. And we found nearly 200,000 rules. If I pick some example of top rules, so for each constraint type, so one R, single value, value type, and so on, we kept, we sorted the rules by first decreasing confidence, and then if the confidence was the same by decreasing support. For example, the top rule for a single value, that constraint was about the gender relation. And was stating that if the value for the gender relation is the gender relation of multiple values, and one in male organism, and is also the subject of the relation is also in a sports team, then you know you should remove male organism. So it's quite funny that we have this rule top. It's probably because some bugs or automated system introduce a lot of these values for a sports member, and then fix the problem. And so here we actually learn how to learn this fix. If I take another example, and so the one of constraints here are things that are targeting the manner of this relation that has only a very specific set of values, and things have the value of this relation in traffic accidents, then it should be moved from the manner of this relation to the cause of that relation. So this rule also is quite nice because you see how you are not only removing the bad value, it's moving it to something else, and so actually improving well the knowledge base. So with that, we did the first experiment of our system. So what we did is we applied the rule remind of the correction within our tools to train the rules. And then we completed two metrics, the precision of the recall. So the recall metric is used to know how well we are able to actually predict some correction. So we just compute the number of past corrections for which we are able to find a correction, good or not. In terms of precision, that gets only the corrections we, the possible correction we found and gives us the ratio of correction one. And then we compare the approach to two, let's say simple baselines. The first one is removing the violation main statement, so basically the statement that has the violation warning and wiki data. And the second one, so it doesn't work for all constraints types, but what it does, it had a missing report that could fix the violation. So for example, for the value, the type constraints, it has a new type to the subject of the statement. And so there are reasons. So here we do not put in precision or recall, but to keep a simple unigraph, we keep the F1, that is basically a combination of precision and recall. So you have here our system called Corist in yellow. And we compare it to two baselines, delete in brown and add in blue. And you could see that, except for single value and missing values constraints, we approach significantly out there from our baselines. So it's the thing that's actually making rules in terms of suggesting that the dumb things work quite well. Two interesting points, so for inverse symmetric actually, so at baseline is very good. It's probably mostly because for this constraint, it's saying, so good correction is nearly all cases just hiding the reverse three points. So for example, if the words A crosses the words B, then the words B crosses the words A, and it works very well. And for single value and missing values, we have actually better precision, but a very small recall. And it's mostly because these constraints are applied in nearly all cases for external identifiers. And we do not have any description in wiki data of the external identifiers themselves. So we're not able to mine rules that know something about the external identifiers. And so we have no rules for most of the relations that have these constraints and so very small recalls. And so it's still an open problem to know how to suggest a good correction for violations for these external identifiers. And so we did also another experiment that is also a way to give back the results for work to the wiki data community. We've created a wiki data game that suggests corrections for the constraint violations. So we ran the trial. So in the statistic I'm going to give, it was for the first three months of the life of the wiki data game. We had an early 50-party segment. And we loaded into the tool database nearly 50,000 suggested corrections. And so we have very mixed results with respect of constraints types. I'm going to describe it here. So first of all, the inverse and symmetric constraints type who had 22,000 actions, so approve or reject suggested correction and with a very high approval rate of 92%. It's mostly because there were still some easy fix to do that applied on a lot of items. It was mostly actually wrote on in the Japanese network in our sample. So some user actually approved a lot on the statement. So it gave us a very good ratio, but it's probably stuff that could have been done by both. And then for a value-required statement, gone through with relations. So we had also a significant number of actions for each one, more than 1,000, and very good approval rates. So we also found good rules and applied in a lot of cases. And we're not already done by wiki data users and bots. And bots, sorry. And then for the other ones, we have often much lower number of actions and much lower approval rates. And if we played with it, my impressions are there have nothing to support it. That's more of a lot of the cases there were in, sorry. In a lot of cases, we had bad rules that gave us suggestions of other bad things. But we also had a huge number of, let's say, heart violations, whereas that were often H-cases and so on. And so for them, it's very easy to just guess what was a good way to fix the violation. So actually, it seems quite challenging and interesting. So we have so in the violations of our remains, we have a lot of biases that say, well, we suspect that what I've been done by bots are not already. And we still have a lot of hard stuff that remains. And so for which we need to find good ways to suggest good corrections. So to conclude, in our work, we first provided, I believe, what is an interesting data set of past corrections and wiki data. So it's probably quite interesting learning tests to know how to find a good correction for consent donations. And then we provided a first approach for this problem based on remining. This approach has been significantly better in baselines. Here are a lot of cases. And the advantage of being under some label because we made a rule. And then we could stay to the user suggesting this correction because we apply this rule. And so because there is a specific case that's happened. And so we have some, let's say, an extendable AI system. And then my, let's say, single point on the conclusion that with bots and wiki data, easy stuffs are mostly already done. And so they are still mostly, let's say, hard solutions that remain and may need more profound trends like remining to be solved. And so it leaves us some quite interesting program. I think it's a good example of them or it's a problem like there are two birth dates or if a birth date is missing. Because for this, we could not only work with wiki data because you don't have the way from wiki data only to know which of the two birth dates, for example, is good. And often when you have to deal with these violations, you need to check the sources. So wiki pdns are all saying if the violation is okay and there is two adhesive birth dates and so we should keep them both on wiki data or we should keep one because the second one is not adhesive at all and so on. So it needs external information and so it would be interesting to have systems that are able to use this external, as it's not wiki data, not knowledge-based knowledge to solve this problem. So if you are interested to work on it, I will be very interested in collaborating. And so the second one is I think that for the future is that we had a look here at how the fixing of the violations that similar approach it approaches could maybe be used for other things. So for example, to suggest users edits. So for example, let's say I am doing a specific correction wiki data in all instances of this class and already done it for the 21st, then I could have a system that suggests me, hey, I have seen you have done exactly the same edits on this 20 items. You have 30 other similar items we want to apply the same edit. And so it would help, it would probably allow some users to earn a lot of time, but mostly automating a part of the work without having, let's say, to write about or scripts and so on. And the other thing is maybe that the main rules to solve constraint violation that actually what the rules give us is what is a likely edit to do. And so we could, I'm not sure if it's a good idea, but it could give us an idea of if and for some edits if the edit is likely or not. And so if the edit is not likely, then it may have a higher chance to be vandalism. And so it may be more interesting for a battering contributor to have a look at this edit first. So thank you for listening to my presentation. There was a link for different things I discussed here. So the papers, the Wikidata game and so on, the slides are on the showcase wiki page if you want to read them or use the links. Thank you. Okay, thank you very much, Thomas, for this amazing presentation. So it's time for a Q and A session. We have a few minutes for that. I want to hear from the IRC host who I think is Isaac or Diego. If we have some questions from IRC or YouTube. No questions from YouTube. Nothing from IRC now. Okay. Do we have any questions from the room here from the hangouts? Leyla. Thank you. Can you hear me fine? Yes. Great. Thomas, thanks a lot for the presentation. So my question for you is for outside of the Wikimedia world, if we want to apply the method that you developed for external knowledge basis, how would the edit component work? Is the idea, the edit history component, is the idea that you will have an intermediate stage where you provide basically these labels and then based on those labels, basically, the rest of the pipeline can be developed? If you can speak to that, that would be great. Yes, indeed. So we need some data set to learn, some labels to learn how to learn the edits. So you basically need some kind of rabies, change system when you have a before state with a violation and after state without it. Then instead of the change, we could have a look at with what have been the triple additions and deletions that actually fit, that were local enough to fix the violation. So if you have a look at one of my previous slides, this one, you see that even if the changes had multiple thousand three points added and removed, we could have a look there what have been the actual three points that solve the violation. Because from the constraints, we know that only some three points could actually solve the violation. So I have specific triple pattern from them. And so as soon as you have it, you could apply it. So let's say I have a knowledge base derived from Wikipedia. And so I run my conversion pipeline, let's say every month. And I have a constraint systems and for the diff between two months pipeline executions, I can have a look at what have been the changes and so also which get from it what have been the good, the interesting addition of deletions that solve the deletion. That's it answer your question. Yes, thank you. Thank you, Leila and Thomas. We do have two questions from the YouTube stream. So one is about, can you share the presentation? I think it's on Google slide. So it's a shareable link. I don't know if you can share the presentation. Actually the PDF is on comments. And so link is on Wikimedia research showcase Wikipedia. Okay, perfect. And the other question is about related work and essentially existing research around this problem of correcting knowledge bases. If you can share a bit of knowledge about existing work in this space. Thank you. Actually my impression that most of the work on improving knowledge bases have done a completely different approach. So instead here we are basically having looked at what changes the user have done for the knowledge base. But let's say before Wikidata knowledge bases were not having, let's say, except maybe free base, we're not having let's say big edit history on a lot of users collaborating and the users fixing problems and so on. Because it was mostly, let's say, like dbpedia or yagos that were completely extracting from Wikipedia or free base that were probably mostly improved by increased by automated systems and so on. And so we're not focused on it, but they're mostly been worked, let's say on using the countries themselves. So for example, there have been works that if you have some contents for all that, for example, express in description logic, then you could find what is the smallest subset of facts to remove to create a knowledge base. So we have in our paper, we have a section that already did works that talks about these things you may have want to have a look at it. So we, I don't have the name of other papers here in my head, so I'm not able to give them to you, but you definitely should have a look at the workstation section. Okay, these were the two questions from YouTube. I think we can keep other questions, if any, for the end of the other presentation. And maybe we can pass to the second presentation of today from Bethnyk. Bethnyk, if Thomas, if you can stop sharing your screen, I think Bethnyk can share his. Yes. Okay, hello, Bethnyk. So the next presentation is from Bethnyk, who's a postdoc at L3 at Hanover and a formal collaborator of the research team. He's going to present some work about Wikipedia tables. And this is another paper who is going to be presented at the web conference in San Francisco. Bethnyk, can I mention, we can see your screen. Can you, yeah, that was my question. Because I kind of put, maybe I just open the PDF, probably this year. Whatever works for you. So the presentation doesn't work. I think this is, if you can zoom a bit, maybe this can work, no problem. Okay, so the floor is your, Bethnyk. Thank you. Are the slides visible now? Yes. Okay. Yes, if you can maybe zoom a bit, on the central slide, I think it would be easier to see the text, but. I did, I put it in full screen now, so. Bethnyk, sorry, we don't see that full screen. So I'm not sure, maybe you have not shared your entire screen and that can sometimes happen if you just share the window, it doesn't show the screen. Oh yeah, I see, I see, okay, let me redo it again. Is it better now? Great, thank you, yes. Okay, yeah, sorry for this. So yeah, thank you, Miriam. Thanks for the invitation. Yeah, so I'm a postdoc at the El Trace Research Center here in the University of Hanover. I'm going to talk about some recent work, which we kind of coined with the term tableness. So what we tried to do here is basically come up with an approach, which for given some pair of tables, let's say in this case, Wikipedia tables, you can determine the relations between such tables and the relations have to be fine grained. So I will kind of mention later like what type of relations we are interested and what the relations are possible in our setting. And the work, the paper you can find in the GitHub page I published there the data and some of the code, some of the code is going to be published soon before the conference venue, which is sometime in May. So the first question, like if you're considering tables it's like why actually do we bother about tables, right? So what is so important about them? So tables in general, they contain factual information and especially if you consider the Wikipedia tables, they are very rich in such a type of information. For example, here you have 100 meters running race, right? So this is the Wikipedia page, which contains a table about continental records and this table, for example, contains different categories about men and women. It groups these individuals into the different continents and has information about their time with which they finished this 100 meters race and so on and so forth, right? So this is quite interesting. And then you have another table in the same article, which is about what are the season's best results in this type of race? And this table is about the category of women and you see them chronologically ordered from year 1972 until nowadays. So now this information exists a lot in large numbers in Wikipedia and however it is there mostly like for, let's say, to of course to kind of show these facts, but they are just sort of in isolation. So this kind of like links a bit to the previous presentation about knowledge basis where you have this factual information and usually they are used in question answering and you have some question and then you kind of query the knowledge base and then you get a factual answer. So, but if you consider more complex questions, there are almost no resource that can answer such question. For example, here, if you want to answer what is the time difference for the best time in women's 100 meter race in 1974 and 2018, you would not be able to get this answer, let's say from one single source. But we saw before that we have multiple tables which are scattered in isolated or the information is isolated in different tables, different articles and basically you cannot really get easy access to such kind of answers, right? So now if I kind of put these tables side by side we can clearly see that if I pick year 1974 and 2018, so in 2018 we have one person from United States and basically then the answer would be like the difference is less than a second. So, but this information obviously you would need to know that these two tables are related in order to kind of like come to the idea of joining them for providing one single answer, right? So this is kind of the why tables are important to kind of use for question answering tools or for any other type of research. In Wikipedia, there are more than 530,000 articles which contain tables and we've extracted all of these tables and roughly right now there are more than 3 million extracted tables and each table has roughly like on average 10 rows and if I'm not mistaken, around the six columns. So if you like just kind of some of these numbers you get like more than 32 million rows and what can happen with this type of information that if you kind of like convert it into facts you can get like hundreds of millions of additional triples which can go into knowledge base like Wikidata or any other knowledge base. However, these facts are much richer than let's say the info boxes which are used mostly for knowledge base because the info boxes have like a very strict template and they have only specific type of narrow information whereas in tables you have like really rich information spanning in multiple topics, multiple time points and so on and so forth. So actually harnessing this information is very, very important. So in Wikipedia tables obviously so what we had to do so you can find the extractor code in the link below in the slide. What we had to do is we had to account for the different type of tables that exist in Wikipedia. So you have to remember that in Wikipedia tables are kind of optimized for human readability. So they have to kind of make sense when a human reads those tables but if you go and kind of try to process them automatically they are like very different representations and it's not very trivial to kind of infer some structure that you can then basically use later on to extract these rows from the table. For example, here you find that you have three columns however in the first row you have the cell and the second cell which is spanning on all three columns and then you have like column headers which span multiple rows as the case here with season then you have multiple layers of column headers like for example here originally they're released then you have first released, last released. So the inferring of the table structure is not so trivial and this is one kind of first challenge that you encounter when you try to extract these tables from Wikipedia. So we did that and I won't go into details how we did that but we published them kind of in adjacent format and with all the different we have some schema that we infer for these tables and we distinguish between values which are hyperlinked to some other article and what we call instance values and two other primitive values which are just let's say string, numbers, dates and so on so they do not link to any Wikipedia article. So this is how you will get basically the table data and you can download already all these three million tables and yes, so you have this adjacent structure which you can then use to do further processing. Okay, so this is kind of like just an intro into the tables problem and tables in Wikipedia. So what are the challenges and what are some of the opportunities that we can basically use? So as I said, there's no explicit schema which defines this table. So what does the column mean? Like you might have a column called the name but this name might refer to various different things for example, name of actors, scientists, race types. It can be anything, right? So the challenges here like how do you infer such representation for a column then of course as I said, the representation is meant for human readability. So you might get the content of the table in HTML or Wiki markup but in both cases, this is meant to increase the human readability and in each of these tables there are several kind of columns which are seen as more important without which the information in the other columns does not make any sense. For example, if I remove in the previous example the athlete's name, so then those numbers about the race time and so on, they are meaningless, right? So now if I want to kind of like align these tables then I need to take all this information into account. So that's what we propose. So we have this table length approach and what we wanna do is basically automatically extract tables from Wikipedia and efficiently align these tables with high accuracy which is highly important and but we also want to retain high coverage. So for each table that I consider, let's say, as my source table, I want to have a coverage guarantee that I am able to find all the, or with a high coverage, the related tables and define such relations. One example is again dealing with this 100 meters race article and then here in the left most table you have the continental records and these are the best records in this race which basically can go any type of athlete that does this race. And on the right-hand side you have the distribution or the more specialized tables which are all-time top 25 women, man and under 20. So for the juniors, right? So here we have different type of relations between the tables. So for example, the left most table with the right most tables you have this subsumption relation. So basically the left most table kind of contains semantically the three other right tables whereas the right tables, the, for example, all-time top 25 women and all-time top 25 men they are kind of equivalent because they are talking about semantically similar information. And so basically I can align them saying that they are equivalent in the sense of they represent semantically similar information. And on the other case basically there's this subsumption so that in the left most table can contain theoretically all the information coming from all the three other right most tables, right? So this is kind of the alignment objective. So to do this, as I said, there are quite some challenges. So the first part is to extract these tables and how do you kind of infer the schema and generate a schema for these tables? So I'm not going to talk about that because I think there would be no time for that but you can see read the paper or you can just look up the code to how we do that. I'm going to focus more on the, for any given Wikipedia article that contains tables. So how can I generate candidate articles who also contain tables such that I can basically find the high coverage and the tables which can have a relation? And once I have this candidate set I have to decide which of these table pairs from my source article and the candidate articles can be in an alignment, right? And we will see later on what are these type of, so as I said equivalent and subpart of relations between tables. So for the first step, the candidate pair generation as I said we have more than 530,000 Wikipedia articles so if you want to ensure coverage then basically comparing everything with everything this leads to a combinatorial explosion so it's not doable. What you need to do is you need to come up with efficient ways to kind of generate these candidate pairs such that you reduce the irrelevant pairs but at the same time you keep high coverage of relevant article pairs and for that we propose a set of features which leverage the Wikipedia articles themselves. So the first set of features basically is, so because we define that equivalent or subpart of means that the information in these two tables has to be semantically similar so basically they have to be at the same topic or instances in those rows should be semantically related. Therefore the best or the first entry point to kind of check for this consistency is that you look at the article abstracts of you have two articles here for example Game of Thrones and the Handmaid's Tale so you see that both are TV series and there should be some kind of lexical cues that give away that these two articles are talking about the same kind of type of information namely TV series, right? So for that what we did basically was that you compute these representations of the article we use stock to VEC for that so it's similar to both to VEC which is that kind of captures more contextual information and then we compare their representation and see that if it's above some threshold then we say okay these two can kind of can generate a candidate, can generate a pair and similarly we also use more traditional IR features like TFIDF the reason for that was that if you use like word to VEC or doc to VEC so what you're not really capturing is the importance of the different terms so and this is what the TFIDF term scoring captures so that basically you try to see if the two Wikipedia articles have these discriminatory terms in both of their abstracts and if they do then it means that they kind of contain similar information or at least these discriminatory terms captured by the TFIDF score. Then obviously like since we're talking here about Wikipedia articles they all belong to certain set of categories and categories try to capture this let's say topicality but of course they are assigned by editors and in not all cases you have like elaborate category associations and sometimes these category associations are temporal basically grouping different articles into different time spans so what we did first was that we computed representations of these Wikipedia categories we extracted the Wikipedia anchor graph and we compute category representation using graph embeddings and then for each article there what we do is we compute the similarity of their categories that they are associated with in terms in this embedding space and then we do this for the direct categories or the parent categories and again since most of the Wikipedia articles have a corresponding entry in one of these knowledge basis we also check there more let's say elaborate or more consistent type taxonomies that exist in knowledge basis like the Wikipedia or other knowledge basis and then see if they are similar so through this we try to capture this course bring the topical similarity and obviously since our main goal is that we want to compute the we want to align tables so what we want to do is we want to see also like very lightweight set of features that say okay so if you have two tables so the first thing that you should do is if the column headers actually match and in what positions do they match because if I have season in the first position in one table and season in the 10th column in the next table it obviously means that the tables have completely different semantics and therefore I shouldn't maybe consider them as a candidate pair so these are some of the features and I see that I don't have too much time so what we do then we use these features for filtering and then also train a supervised model to basically classify these pairs as to being relevant or not and then once we are left with a candidate set then we go and define like okay which actual table pairs can be aligned and which not so this is a bit more sophisticated so what we have here we have since each table has kind of encode horizontal and vertical structure so by that I mean the horizontal is the schema and the vertical is the table rows so what we do is that we represent a table through its columns and each table is a sequence of columns which we encode here with the gray ones is the column type which we already have for basically for columns which contain instance values and for the others it's basically non-existing type and in the case of instance values again we kind of average the representation of these instances through some graph representation and then we have the textual representation which is just the column header right and then we encode these tables by using bi-directional LSTM so what is interesting here is that we have here this delimiter which is basically initialized by the last column representation of the source table and then you basically try to generate the representation of the target table which you want to align to and on top of this on all these hidden representation you try to compute these sub alignments between columns in the target table with the source table that gives you basically with which column one column in your target table is matching to and you also have this position information and then the last basically column in the target table what it will, the representation is computed based on this column by column attention what we call then you use that representation for classifying that a pair of table can be a sub part of equivalent or no relation at all and yeah the column descriptions I think I might just skip that because then I might not have time to go into the evaluation but basically what we use for representation of the tables we have the column description which is the column header the instance values which are the cell values in the column and the type of a column in the case of instance value based cell values so in that case we just use the least common ancestor type for all these instance values right but you can find more information in the paper so the setup again there's no ground truth this is kind of a novel task so what we had to do is that since we want to ensure coverage we cannot go for a very large scale evaluation we took a random sample of 50 BTP articles and what we have to do then we have to ensure that for each of the tables in these 50 articles we have to ensure coverage so that all the tables from other articles in this case 530,000 articles we can find the relevant ones we have to do this in efficient manner so what we do that we do construct manual filters to remove articles which are not relevant and once we have done this we do this multiple iterations and in the last step then we basic crowdsource the remaining pairs and since the task is quite complex you're given two tables and then you have to decide between subpart of our equivalent then we provide comprehensive worker training in the crowdsourcing platform so that they have detailed instruction on how to actually conduct this labeling task so from all these kind of different steps we had to crowdsource the 17,000 table pairs out of which 52% had no alignment and the remainder had equivalent and subpart of alignment then the ground truth is available as well for download and can be used as a benchmark for a similar task okay so for the first step as I said we had this feature set which we computed so we first used them for pre-filtering then we modify a random forest classifier such that we kind of optimize it to increase the recalls since we want to have high coverage in these table pairs and then basically here what you see in this plot that for the variant classification confidence so how does it look if you're more confident how many of the actual relevant table pairs you retain and how many you drop so for us like we had roughly of confidence score of 0.5 where we have a reasonable coverage of 80% and so then we use this to kind of filter out these candidate pairs and the remainder we use then for the second step which is a table alignment and again here so what is important like not to bother too much about these results in the table but in the bottom rows you have this table net model which is this bi-directional LSTM with column by column attention that you can basically distinguish with high accuracy between the three different type of relations or two in this case equivalent and so part of this roughly an accuracy of 83% and there are some baselines for example Google Fusion had this project and that is unsupervised but we kind of modify this so that we can compare it against to and their app score is quite low it is understandable because they also like they want to find only related tables so they do not have a crisp definition of equivalent or subpart of relations in their setup to conclude so table net in itself it has its methodology on how we conduct this work but in the end it is the knowledge graph of these aligned tables which I think is quite interesting and important to have as we saw from the motivation it can answer really complex questions once you have tables aligned it's subpart and equivalent relations and it improves over existing work because we first are the first to define this fine-grained table relations and we also provide an extensive ground truth which gives you coverage guarantees which has not been done before so resources for the paper are accessible in the GitHub page and as I said a lot of it is already public but there is some code that needs some polishing and will be published before the web conference in 2019 I hope I was not too quick and I would be happy to answer if there is any question at all thank you thank you so much Besnik for this presentation so let me ask the IRC and YouTube host if we have any questions from the audience no questions on YouTube yet okay nothing in IRC now but let's see okay does the room have questions? not at the moment Besnik I do have a question for you as you know we work so I imagine this work is very much focused because it's a task work on English Wikipedia as you know we work with in general we tend to work as multi-language so looking at the type of features you're using in your models I imagine this work could be extended to other languages but I wanted to hear your thoughts yeah I mean correctly noted so I kind of put too much details into the slide so that I could not really elaborate on the individual features but yes so all the features are let's say they are not language dependent so all these embeddings graph embeddings and word embeddings can be trained on different languages and then they can be operationalized into other languages the only problem that might be there is that you might not have quality guarantees because you would have to generate ground truth for that so the ground truth is only for the English Wikipedia right now but I assume that a similar especially for the more let's say popular localized versions of Wikipedia let's say like German or Spanish I think this might work comparably similar so yeah okay thank you very much do we have any updates from the channels? Leila, I ask your question, sorry we have one question from YouTube Fin Nielsen asks has the graph embedding model been published? Hi, so I'm trying to understand the question so the graph embeddings are basically embeddings that, so we use existing work so we use Notovac so basically they are just applied on the Wikipedia anchor graph so I think you can apply any graph embedding approach and you might, you probably will get comparable performances so I know that there is quite some variation on the embeddings that you get from the different graph embedding approaches but that was not the main focus of the paper the focus why we computed graph embeddings was for the categories especially was that the association of the categories kind of it follows this taxonomy and also it's done manually so we wanted to lift this into a different representation space so that we can have the same more reliable comparisons but the embeddings basically computed using Notovac I did not publish it since it's like quite some gigabytes but if Fin wants to have the embeddings for the Wikipedia anchor graph I would be happy to share that with him Excellent, thank you If I can I'm going to ask another question I'm not sure if this is something possible but do you think you can expand this to to align tables outside of Wikipedia as well from other websites? Yeah, so the only feature so the only step which is kind of what would fall off from the current approach would be this category features that we use in this type of candidate generation so the rest would have no would have no basic it wouldn't have to change much to be applied to web tables let's say so that is something that I have in mind like to kind of extend to and see because in the web it's harder to to give coverage guarantees because it's simply impossible to do that in the case of Wikipedia it was more constrained the problem on giving this coverage guarantees it was still quite challenging to do that because it took quite some time and efforts from crowdsourcing and myself as well but yeah, I think it can be easily done the only thing that Wikipedia has an advantage with respect to let's say web tables is that I think the hyperlinking of the let's say the surface forms to actually Wikipedia articles it's more consistent and this gives you additional information in the web tables you would have to do this through an entity linking step that is geared for web tables so that would be an additional step that you might have to take in order to have this richer column representation other than that I think it wouldn't change anything at all Okay, that's very exciting work and do we have any other questions from the room for Thomas or Besnik or from the other channels don't hear anything so if there are no further questions for our speakers I think we can thank them very much for joining us today it was an amazing set of presentations thank you both very much and I guess we will see you all next month for the April research showcase thank you thank you very much thank you Thomas thank you thank you all bye