 Hello everyone, this is Gokul. I'm going to present our work on a web centric entity salience based system for determining notability of entities for Wikipedia. This work has been carried out under the guidance of Professor Vasudeva Verma as a part of Information Retriever and Extraction Lab triple ID Hyderabad. Coming to the motivation of a problem statement, the rate of creation of new Wikipedia articles is increasingly high and it's really about 16.5k articles per month and manually adding them is not really a scalable solution and is a cumbersome task. We should ensure that only notable entities that is the entities which actually require or warrant their own article in Wikipedia get added to Wikipedia. Notability is a test which is decided by the editors of Wikipedia which tells us about if we should include an entity's article in Wikipedia. This is based on several factors such as significant coverage about the entity on the web, verifiability of the information present, etc. Including this article edition process by verifying notability of entities would help greatly in the platform functioning in a much more efficient manner. There hasn't been much work done on this domain and majorly only one work focuses on this exact problem. This work has been implemented for the category of Indian film actors and relies on the two major notability criteria which are the reliable contents availability on the web as well as the coverage of the named entity in this content. Top reliable web domains for a category are identified such as imdb.com for film actors and we try to identify this entity's presence in such a web domain. This corresponds to the reliable web domain features as you can see in the picture to the left where a one represents there is an entry for that entity and zero represents there is no entry. And we also try to analyze the text related to this entity by using handcrafted entity salience measures such as the positional index, the first location, number of times the entity occurs in the first three sentences and number of times it occurs in the article. The previous approach failed to address the issue for Wikipedia categories such as biological concepts and theories and so on. We defined two settings to overcome this drawback of the baseline approach. Firstly we have the generic setting. This setting basically consists of categories that have large number of entities such as actors, films, cricketers, software companies where we can identify a required number of reliable web domains. We have constructed a data set completely from scratch including entities for many of these categories such as film actors, birds, cities, cricketers, films and so on. Note that we can identify web domains like crick bus for cricketers and imdb for film actors. For performing the final classification we focus on five aspects in total. Firstly we look at the domain specific platforms for example for categories like film actors we look at domains like imdb.com, filmibb.com which have been identified as reliable web domains and try to quantify the entity salience by using handcrafted entity salience metrics similar to the one defined in the baseline. Secondly we have the Wikipedia ecosystem in which we look at various things such as wiki data which corresponds to the number of relevant documents which mention the entity or are related to the entity in wiki data. Similarly the number of articles which are relevant to the entity or mention the entity in Wikipedia has a number of images and videos which are in wiki comments platform pertaining to a given entity. Next we have the social media platforms with the followers count on platforms like Twitter and Instagram which tell us about the popularity of an entity at a given time. Further we have query logs analysis which is being performed by Google Trends which tells us details regarding how interested people are about an entity at a point in time. Finally we have the online news web domains where we try to capture the number of news articles in these particular domains corresponding to a given entity. Next we have the abstract setting. This setting consists of categories that typically do not fit into this generic definition and for which the content is spread across the web. We cannot really pinpoint web domains like we did in the case of film actors and cricketers and so on. We construct this data set with categories such as conception biology, chemistry, psychology, karnatic ragas as well as computer softwares. But in this case as we cannot identify reliable web domains, we have to look for alternative ways to get the information about an entity on the web. To this end we define information distribution features. These have the intuition that if an entity is notable, its coverage on the web is high and more documents centered around it will be retrieved when we perform a web search. The idea we use is that we formulate an elaborate search query including the entity name, category name and some keywords like information profile and so on. And we perform a search on a search engine like Google. Of the retrieved documents, the top 15 are considered and we partition that into five sets such that the first three will be coming in the first set and the last three will be coming in the fifth set. These numbers have been chosen based on empirical experiments. For each of the set will be trying to pick the document with the most amount of the text. This is done to prioritize those documents which have more text because this will increase the probability of having more coverage about the entity. Entity savings features are extracted for each of these five documents which are text rich and presumably related to the entity. We also control on the search engine to ensure that only reliable results are retrieved. In this manner we have obtained a significant number of relevant documents for a given entity similar to the previous generic setting. For performing the classification in the abstract setting, we are relying on three aspects. Firstly, we have the Wikipedia ecosystem features which is really similar to the generic setting. The information distribution features which are sort of like a replacement to the domain specific features involves identifying relevant documents and performing their entity sales analysis. Finally, the query log analysis as given by Google Trends is similar to the case of generic setting. Coming to the final classification architecture pipeline, firstly, we have the numerical features encoding which consists of all the web based features we have discussed so far. Features like Wikipedia ecosystem counts and query trends are common for abstract and the generic settings, whereas the generic setting we also have social media and online news. The main specific features are used in the generic setting while alternately web information distribution is used for the abstract setting. All of these count features are captured into a single numerical feature encoding. Further, the text we have extracted from all the relevant documents in Wikipedia and from the information distribution documents on the web and reliable web domains profile pages. This text is further encoded using bird to pay attention to key parts of the content. Further, we also capture what we call the description for every entity which is either a Wikipedia page description if it is available or it is a web profile info which is either content from the most relevant Wikipedia article or the most relevant web document. We perform max polling to reduce the dimensionality and capture all the key signals and then we further pass it down through a feed forward neural network. We also opt in the categorical embedding for a category by passing its ID to differentiate across various categories. All of these encoding inputs are passed through a feed forward neural network and we finally arrive at a single number which tells us the final notability label. Coming to the results of our system, we have used precision recall f1 score and accuracy which are the standard accuracy metrics and we have seen that our system significantly outperforms the baseline we have defined. There's an improvement of about 9% in the case of generic setting which is depicted in the first table and there's an improvement of nearly 14% in the abstract setting. Coming to the future work, a drawback of our system is that it is primarily designed for single entities. These include actors like Shah Rukh Khan Salman Khan, cricketers like Virat Kohli and so on. There exists articles in Wikipedia which have complex titles which could have multiple entities or concepts like the list of collaborations of Tretman, Agatnam and Iyer Rehman and so on. There could be a non-trivial relationship with the category it belongs to such as there could be an island's article in the birds category. It is important to identify if a given title is simple or complex and also try to identify which are the entities involved in the title. We are working on incorporating these aspects into the system by using a graph-based approach for correlating the entities and concepts within a title. This would help in complex title handling as well. Thank you.