 Hi, my name is Umas Aavan. I currently work at LinkedIn. But I'm going to talk about something that I worked on while I was doing my PhD. So the title of the talk is Lena Lina-Histat, Entity Search on Text and Drive. All of us use web search engines to find answers to our query. Now, the search engine gives us a set of documents. And we are expected to go through these documents to find the answers. What if the answer was directly presented once? Doesn't that just upgrade your search experience? The Entity Search builds upon this philosophy. It's trying to find the needle for you out of the list type of talk. So after this talk, I am hoping to leave you with an awareness of this different flavor of search. And if you are a search practitioner, I hope you can carry back some of the insights you can use to upgrade your own search engines. These are the names of my collaborators. This is an ongoing search project at IIT Bombay. And first, talk about some of the background. What are entities? What is LinkedIn search? Why that is relevant? Followed by a deep dive on the technical details. And then we'll go over some of the experiments we did and some insights that came out of those experiments. So first question is, what are these entities I'm talking about? Simply put, entities are objects of interest. So those could be people like Sachin Tendulkar. Those could be locations, movies, monuments, phones, cars, universities, and so on. Any object of interest. Entity search is a search engine that takes in a text query and presents results in terms of these objects. So for example, if my query is Brad Pitt movies, I don't want to see documents. I want to see a carousel over movies. If my query is universities, India, again, I expect to see a carousel over colleges and universities. So in this case, my entities are movies and universities. Research shows that at least a quarter of today's web search queries are actually seeking entities. So if we can do a better entity search, we can really upgrade the search experience. What are these queries? So some examples are Bahubali lead roles, fastest ODI double century, main oncology universities, India, and which phones have unsafe batteries. Now at this point, somebody may ask, aren't today's search engines already doing some of it? Yes, they are. But there are two main problems as we see today. First, the lack of coverage. Although as I said, at least a quarter of web search queries are seeking entities, they're still being answered in terms of documents. If I query for main oncology universities, India, document results. If I query for which phones have unsafe batteries, again, document results. The second main problem that we see today is lack of robustness. For very similar formulation of queries, I see entity results versus document results. If I query for murder books, Agatha Christie, I get a nice carousel over books, which is what I want. But if I query for murder mysteries, Agatha Christie, I fall back to document answers. And if you notice, the actual results, the actual entities I'm looking for are mentioned in the top results. Yet they are not extracted out there. So one question is, why does that happen today? So one main reason these kind of problems come up is because today's search engines are dependent on what is known as the knowledge graph for doing question answering. So if you attended Professor Partha's talk yesterday, you would have heard something about knowledge graphs. So what are knowledge graphs? Very simply, these are collection of facts about entities. So they contain types and relations and entities presented in the format of a graph. So here is one example. Here is a very simple knowledge graph. It has two entities, Prabhas and Bahubali. These entities are of type actor and film. And these are connected via the relation acted in. So to give you some examples of public and proprietary knowledge graphs, our Wikipedia is a type of knowledge. DBpedia, which is derived from Wikipedia. Again, another knowledge graph. Then there is Freebase, Yago, Wikipedia, NEL, that is never-ending learning that Professor Partha mentioned yesterday. These are all public knowledge graphs that you can use. In terms of proprietary knowledge graphs, our LinkedIn has its own knowledge graph of people and their skills and their jobs, so on. Google has a very nice knowledge graph. Microsoft has Satori, Facebook has its own knowledge graph of people and their connections and so on. So let's get back to why do today's search engines do not perform well when it comes to entity search? As I said, most of them are dependent on these knowledge graphs for question answering. However, to answer the questions based on knowledge graph, you have to translate the text query, such as Bahubali lead roles into a query which can be executed on this graph. So let's say it can be a query like a select an entity E where E is connected to entity Bahubali via the relation acted. Now this approach works best when your translation is right. You get fast and precise answers. The bad part about this approach is that when clause, it's not always easy to come up with the right translation for your input query. Let me give you an example. Suppose my query is fastest ODI double century, okay? It's a very simple query if you think about it. But if I were to execute this on a knowledge graph, I'm assuming that there is a relation of cricketers and their double centuries and then I have to do a sort on it, right? So for a search engine to do this on a large scale is difficult. There are many tail queries like fastest ODI double centuries where you would not be able to do the right translation. And that's where you come up with the problems of low coverage or lack of robustness. And the second problem which also was mentioned yesterday is building knowledge graphs is a very hard problem. So very talented people have been working on this problem building open source knowledge graph for many years, right? Intuitively it's not easy to cram all the knowledge in the world into one knowledge graph, which is why knowledge graphs will always have a lack of coverage. There may be missing entities, nodes or relations. And that's why when you do question answering on talk, you get coverage issues, right? So our goal here is to create a dedicated entity search engine which will do better than this. It will have good coverage. It will be able to answer the queries, as I mentioned. At the same time, it will have good robustness and accuracy. All right, so that was the background. I hope the problem statement is clear to everyone. Let's do a deep dive now on the technical details. So we said that, yes, knowledge graph based question answering works, yet it has some of the problems such as coverage and robustness. How are we planning to do better than that? So one obvious solution that may have come to your mind is why not we go and append some of the information to the knowledge graph, right? So suppose extraction is hard, relation extraction is hard. Maybe we will mention the relation in phrases form, right? Sure. So there are many such approaches where you go on increasing the coverage of the knowledge graph. At the same time, your accuracy of the facts inside knowledge graph goes down. So there are various method which work on different incarnations of this at end of the spectrum. We thought why not go all the way? Suppose we will use all the information that is there in conjunction with the knowledge graph and what is that source going to be? That is going to be the corpus of all the web documents, okay? So that's the main idea behind this talk. We are going to use the web documents in conjunction with knowledge graph. How do we connect the web documents to KGs? To give you an example, at the top, I have the knowledge graph, as I mentioned before about films, Bahubali, actor, Prabhas and Rana Dagubati. I have some of the web documents here below with, for example, one documents is Bahubali leads Prabhas and Shivindu, Rana Dagubati as Balala and so on. Now if you notice, this document contains mentions of the same entities which are mentioned in the graph. So there is Bahubali mentioned here, Prabhas mentioned here, Rana Dagubati mentioned. We are going to connect the mentions of these entities with the actual nodes in the graph, resulting in what is known as the entity annotated text. So we have created these links. How do we create these links using entity annotators? So more about that later. So at this point, we have a query coming in from the left, Bahubali lead roles. We want to have an entity answer in terms of entities and in between we have these connected information sources, knowledge graph and web documents. That's our entity search engine. All right. So how do we use these two information sources effectively? Okay. So the first part I'll talk about is how to match the query with the structured knowledge graphs. Very simply, the query is a collection of words, right? If I want to answer that query in terms of entities, the hints to what entities I'm looking for are mentioned in the query, in terms of entities, types or relations. My job is to extract those hints and map them to the nodes on the knowledge graph. So how do we do this? We first go and tag entities in the input query text. So given the query Bahubali lead roles, I use an entity tagger to identify an entity. For example, here is Bahubali. I go and pin it on the knowledge graph. As a second step, I look at all parts on the graph starting with this entity. So this is an example of one path. There may be one path here, one path here and so on. So all these parts have entities, types and relationships. And these are the current candidate translations of the input query. So my next job is to take all these candidate translations and match them with my input text query. How I'm going to match them is I'm going to identify the type matches and the relationship matches. Let's see that one by one. Okay, so far as I said, we first identify the entity mentioned in the input query. So here is one open source tool that we use called tag me. Now tag me is trained using Wikipedia where since Wikipedia already has words marked with their source documents, which are the source entities, it uses that as training data and trains a machine-learned model. Now if you give any input text to it, it will be able to identify entities for you. So here is a snapshot of how we use tag me on the input text Bahubali, Kars, Prabhas, Arsha and so on. So it's able to identify that word, Prabhas is actually a link to the entity Prabhas. So using tag me, we first identified the entities. Then the next question came is how are we going to match the input query text to the types and relationships on the knowledge graph. So for that we built our own convolutional neural networks. If you are not familiar with them, do not worry. Just think about this as a multi-class, multi-label classification. So on the left-hand side we have a query coming in, query in terms of words. On the right-hand side, your classifier is going to output scores for each type or for each relation. So if one of the candidate types is acted in or actor, it's going to have one score here. What is happening in between is, first the query is represented. Each query word is represented as a vector. Then there are some convolutional and pooling layers, which give you a fixed size representation of the input query. So these are feature representations of the query. We also appended manual query and type word overlap features. So we found out that not just outsourcing everything to CNN, if we can input our own simple TFIDF type of features, it actually works better. So we did that. At this layer we have a feature representation of the query and then we have a classification layer sitting on top. So at this point we have an entity, sorry, entity's mark in the query text and we are able to identify the match between the types mentioned in the query, relations mentioned in the query with those mentioned on the graph. So this was about query to knowledge graph match. So just to make sure you're all with me, I'm just presenting the way of computing features that are between query and its translation. How is actual machine-learn model is going to behave? I haven't talked about that. That is yet to come. So just hold that thought. All right. So given a query, we are able to match it with the knowledge graph. At the same time, we said that we'll also use web documents for doing question answering. So how is that going to happen? Let's look at that next. As I said, we have entity annotated web text. So this web text has mentions of entities tagged with the original entity. So this word pravas is actually tagged with my node called pravas in my knowledge graph. The way this... I do question answering based on this entity annotated web text is first I do... I identify text snippets which have query words mentioned. So if my query is Bahubali lead roles or Bahubali lead cast, I'm going to identify all the snippets in which one or more of these query words mentioned. So here is one example. At the same time, I'm also looking for all the entities which are again mentioned in these snippets. So one snippet where it all comes together is Bahubali cast. So these are query words. And pravas is an entity which is tagged. So this is a potential candidate snippet. My next task is... Since there will be thousands or tens of thousands of such snippets, I have to score each of these. So some of them may be false positives. The words just came together by chance. These do not indicate answers. So here we use again a CNN to identify which snippets are correct or identify a score for each snippet to give me an answer. So what are the... internally the features that it's going to use is going to use the distance between the query words and the entity word and the IDF score of those words. So using all that, this is the CNN that we ended up using. This was taken from the research work done by Severin et al in 2015. So here is the text document which is the text snippet. Here is the query text. So Bahubali lead roles will come here and here is the text snippet. Internally again it's going to use convolutional and pooling layers to come up with a fixed size feature representation. It's again going to use some additional features. Coming up with a total feature vector followed by your own classification layer. So when we have this CNN it's going to give us match between one particular snippet coming from the web documents and the query text. There are going to be tens, thousands of snippets each possibly mentioning the same entity or a different entity. We are going to sum up the scores over each to give us one final score for a candidate answer such as Bravas. To put it all together, at this stage I have the query coming in from left. For each query we found some translations on the knowledge graphs. So maybe type equal to actor is one. Maybe entity equal to Bahubali and relation equal to acted is another. And in parallel there were some text snippets coming in from the web documents. Each of them connect to one or more candidate answer entities. And my job is to look at all this data and come up with the final answer. Now if you recall at this point we were able to compute the match between query text and the type or the entity or the relations as well as the web text. So at this point we have these feature vectors sitting on these edges. And these feature vectors are my types match score entity match score relation match score. So how do we use these to come up with the final ranking is the question. So if I had access to lot of wealth I could actually hire labellers, experts which will take a query identify the best translation of that query. I would do this for tens and thousands and millions of queries and build a machine learning model on top. However that doesn't work of course I do not have access to all that wealth. And secondly it's very difficult to come up with the right translation of the query. Sometimes you would think that actor is the right type. But maybe the type actor is connected to my answer entity. Maybe is the type person is connected to my answer entity. How is the labeller going to know that? So there are all these questions which is why we decided to not create this kind of fine grained labelled data. Instead we went with weak supervision. So by weak supervision I mean that I'm not looking at query translation. I'm just looking to identify the right answers for the query. So given the query Bahubali lead role the labellers are going to tell me that Prabhas and Rana and Aghubati are some of the correct answers. There may be more. But I'm not looking to tell, not looking for a very fine grained translation such as entity equal to Bahubali type equal to actor. That is hard. So if we do question answering using this supervision we use something called as latent variable discriminative model. So again do not worry about the math. What this slide is telling us is that given a query a feature vector so my entity type scores are sitting here just choose the best translation at this point. So my model is getting trained at every stage you have all the translations laid out assign some score to each and just choose the max scoring query translation and make sure that you find me a weight vector such that the right answer the positive answer entity scores greater than the negative answer entity. And then pictures find me a way to vector this is my classifier such that the right answer brahmas scores higher than the one of the wrong answers may be prabhakar with the help of the best scoring interpretation at that point. And this process goes on iteratively and when it but this is since this is max greater max kind of constraint this is a nonconvex problem it terminates at a local minima that's the best we could this is the total objective function translation of what I just spoke in mathematical terms. Alright so that was about how we are going to solve the problems before we move on to the experimental section are there any questions? Yes So the question is we did question answering assuming that prabas already exists as an entity in my catalog is that the right part? Yeah. So how are we going to know that suppose prabas doesn't exist in my catalog there may be some other entities which do not exist how do we do question answering on top? So in this work we are assuming that at least the entities are correctly present in the knowledge graph they are may be noisily marked in the web text so we are we are making that assumption this work will be able to help you if the relations are not marked well or if the types are not marked well so that's one assumption we are making to answer the second part of the question what to do in case the entities are not marked which this work doesn't handle so if you attended professor Parthas talk today he was talking about never ending learning what they do is they rely on the surface forms of entities or types of relations so those kind of words where you would get approximate answers in terms of surface forms so you do not have to know that prabas is an entity this is a word which potentially indicates your entity and then there are some techniques which will be able to find out those words and phrases for you so maybe you want to look up orain atzionys works or just look up nel ok maybe we will take one or two more questions and then move on to experiment hey ok so I had a question like what are the embeddings that you are using do you generate them by yourself or like are you using some sort of database for the embeddings question is whether I am using some database of embeddings or whether I am generating so we actually used glove vectors to seed our embeddings but they were also tuned to our training data so those were warm started with glove yes we are back just one last question then maybe we will take more questions at the end yeah here knowledge graphs are particular objective to one particular domain or you build knowledge graphs for each domain correct I gave examples of knowledge graphs which were proprietary so LinkedIn for example is going to be a professional knowledge graph then maybe you are building a product catalog that is a domain specific kg related to e-commerce what we are working with are open source knowledge graph or open domain knowledge graph which are trying to capture information from all domains not specific to a particular domain alright so I am going to move on to experimental section let's finish this and then we can take more questions alright so I gave examples of new knowledge graphs earlier we are going to use one of those free base knowledge graphs it has 29 million entities around 14,000 types and around 4.6 thousand relations we used clue web 0 9 as the web page corpus so this is corpus release released by CMU it has 50 million pages Google took this and annotated it with entities releasing what is known as the FACC1 corpus so there are on an average 13 entity annotations per page for measuring the ranking performance we use the mean average precision measure and sometimes we sliced the ranking took the top k and converted it to a set at that time we used f1 as the performance measure and this code and generally the project is available at my advisor's website you can take a look at this link alright we use 4 different query sets some of them were in keyword format some of them were in natural language format resulting in about in total 8000 queries so the first question we wanted to answer was sure we went through all this trouble to add an additional source of knowledge which is the web text how are we are we actually improving the performance so we did a set of experiment where we use only the web text for question answering we use only the knowledge graph for question answering and finally when we use both thankfully as we surmised we are doing better when we use both knowledge sources so here the red sorry the blue bar indicates experiments where we use only the knowledge graph the green bar where we use only the web corpus and the red bar where we are able we are using both so under different set of conditions different experiments we ran different models we use the when we always found better performance when we use both the knowledge sources some examples where why this works better if the query is Bahubali lead roles suppose erroneously my model chose the type as lead alloys okay and it also chose web text where an entity solder of type lead alloys mentioned and one of the query words role is also mentioned however there are not going to be too many snippets in which this is happening and also an important query word Bahubali doesn't occur anywhere near in this snippet so overall when you do when you run your model you have your features this this kind of entity gets a very low score on the other hand when you have the right translation where roles is mapped to the type actor Pravas is connected to the type actor and is also mentioned in web text along with other query words and then there are a lot of such text snippets this works this happens quite often which is why Pravas gets a much much higher score than let's say sold we also compared with related work there are many research groups working on similar similar problems these are some of the examples of recent works where they used only the knowledge graph for question answering so much of this work is is about kg based question answering and also they depend on the syntax of the question that is the natural language grammar of the question that comes up with the right translations so those approaches work well for queries where you have nice formulations you can depend on the grammar but it fails when it comes to the keyword queries which we see in web search usually so it turns out that in 3 out of 4 query sets our model was doing right and in the 4 query set it was still doing better than most of the approaches some examples of wins and losses so we found out some as I mentioned before some of the queries are easier to answer with web text so fastest ODIs are one of the examples but presidents won on airplane or who was the first US president ever to resign these kind of queries sometimes these answers the answers are not present in the knowledge graph or sometimes landing to the right translation of the query which will give us the answer on the knowledge graph is very difficult which is why using web text we were able to do better on the other hand sometimes we found out that using web text was actually leading us to wrong answers so creator of the daily show it turns out that there are many too many snippets for the wrong answers coming from the web documents than the right answer so this is one of the cases where we felt that we have to strike the right balance between the knowledge graph and the web text just using it blindly doesn't always work so we now come come to the final two slides so the insights that I got from all these experiences that one is question answering is a hard business we have made some progress but there are still many complex multi step kind of questions which current models are not able to answer so to give you some examples companies that make fan less motherboards or cities where Einstein taught so here you have to go through multiple steps such as X is a motherboard X is fan less and then there is company Y which is making X so we have to go through all these steps to land to the right answer achieving this translation is difficult either on the graph or the web text so these are some of the questions which is still in future we found out through our experiment that both these web text and knowledge graph make very nice complimentary information sources because knowledge graph is structured and accurate but it has low coverage at the same time web text it has high coverage of course we are using all the data that is there but it introduces some noise so striking the balance between these two is very critical so one last slide this was all about open domain what if you have a closed domain application so this still works for you so suppose you are doing corporate or enterprise search so you have corporate hierarchies of which teams, which managers, which peers for people maybe some keywords on what they are working on and you have in parallel their weekies and internal data which works as plain text you can just use both of these together to do a better enterprise search similarly if you are doing product or e-commerce search you have a nice product catalog but at the same time you have user reviews as well so again you can combine these two to do better search so that brings me to the end of the talk thank you very much these are some of the papers that have come out of this research feel free to look at it if you are interested in more math or more details so we can take questions thank you Umar can we have a round of applause please to the people who are standing at the back I am really sorry that you had to stand through the talk but I am really happy for you Umar that you had like an amazing house full house for today ok so we can take some questions now also if anybody is feeling reticent to ask questions you can feel free to like tweet it and I am happy to read it out on behalf of you we can just use the hashtag 5th elephant ok questions hello hello Umar ok yeah hi so is the order of the words matter in the current research what you had done so like lead role in Bahubali instead of Bahubali lead role right right so to a large extent so I started this work with using only keyword queries that time our purpose was that order should not matter at all right but then I came to the conclusion and in more recent years I figured out that keyword queries are not the only queries we want to handle we also want to take into account the query grammar when it is present so in summary the system should be able to make use of the grammar when it is present but not depend on it because we also have keyword query which is why we sort of outsource this work to the deep neural network where we are making sure that we provide nice balanced query data coming from both keyword and natural language queries hi Oma yes so in using the combination of two methods is there a significant performance overhead that you face for search queries like this not significant there is yes there is some certain performance overhead but thankfully both IR which is the document search has come a long way of course we can make use of index caching and maybe pre compute some queries we can do some tricks like that to make this practical hello yeah Oma so I have one question so when you talk about entities so I think ontology also plays a big role there so you would have usually preferred term and like associated terms to build that right so one of the challenges when we try to adapt the public tools for building the knowledge graph what we often face is the missing terms along with the ontologies so do you think any of these public sources give that agility for us to you know add more and more entities is there any faster way because it's always a very time consuming process sorry is the question how can publicly available graphs give you the freedom to add more entities so each of these public sources we use by the way semaphore it's like one of the entities the same but it's like very time consuming when you want to add more entities right so is there any quicker way you think that can be handled so I have seen that mostly it goes through a crowd sourcing plus some sort of supervision kind of process so you would make your suggestion but that would go through a review to make it into the knowledge graph so yeah there is some flexibility but it's not very straightforward yeah Yuma so my question is when we work with such knowledge graphs sometimes the timely manner fashion means what happened in 2011 let's say some theory that has been proved wrong in 2015 or something so this type of thing like who is the president of US we need the time for that when I am searching so such queries how it can be handled with right so we are assuming that that kind of information is available in the knowledge graph so to give you an example Yaku that's one of the knowledge graph we used earlier that has actually come up with these kind of time based facts so yeah or if it is mentioned in the web text we will be able to make use of it but we are not doing any kind of specific checks on top in our model to make sure that temporal facts are getting covered we are assuming that the knowledge graphs will take that later just a second how many people have questions can you please raise your hand one two three three questions we can do that right so one right on top vision part okay am I audible so can you elaborate a little on the weak supervision part I mean because I have come across this for the first time so I want to know what it is maybe a couple of sentences and what was the relevance what boost did we get out of it right so just I'll give you a quick answer because this is a very long question we can take it later so by weak supervision I meant that you do not have very fine grained supervision of this is the right translation this is the type this is the entity this is the relation that kind of translation is very hard to come up with what we could do is given this query this is the answer right so by weak supervision I meant that we are missing the intermediate step we are going directly from query to answer and which is why in our model it's also we are not taking that intermediate query translation as we are keeping that as unknown we are taking a max over that but more details I think we should discuss later so hi thanks so two part question one is you used webtext to augment the knowledge graphs answering capabilities right so how would you compare that with an approach where you use webtext to augment the knowledge graph itself and then ask questions over it the reason I'm asking this is because sometimes you know having a huge corpus of documents particularly in domain specific cases is hard right so web is kind of a happy case there that you have lots of documents that's one part of the question and in some sense related part is how do we make best use of like you know what parts or what types of queries are best answered with the knowledge graphs versus documents we'll take the second part offline for the first part yes there are approaches we do which do this we have the knowledge graph you would take some text or some description go and append it to the knowledge graph so this is not like you have the whole webtext with you while doing question answering but just maybe some snippet high quality snippets the advantage you get out of this is you are only doing question answering on the knowledge graph so it's going to be faster we have a little more flexibility than before because maybe you added some semantics or maybe you added some new facts which you could answer but you lose the extra flexibility that you get when you have the whole webtext with you right so if you are thinking about this from head and tail queries right there is a long tail of queries you are losing flexibility here you're doing better in the intermediate area make sense thank you do more people have questions we have one question here I'm afraid we are running out of time do you mind taking it offline is it okay