 Hi everyone, thank you for joining our session today. I will be presenting identifying the gaps in the coverage of web domains and Wikipedia and Wikidata for credibility assessment purposes. My name is Melinda Lu and this is work that I did last summer with my peers at the Wellesley credit lab supervised by my advisor and Mustafa. One of the goals of our research at the Wellesley credit lab is to assess the state of current online information environment and the role that actors like Google and Wikipedia play in shaping this environment through their content and organization of information. The research I'm presenting today continues to explore the interplay between these two platforms. So the presentation is composed of four parts. First, I will briefly introduce schools about this result feature. And then I will discuss our data collection process that starts with auditing search results on Google, then extracting links to Wikipedia pages and domain properties from Wikidata. I will then share the results of our analysis and finally discuss what our research means for future work. As you can see on this example, all results in the Google search result pages or SERPs are now accompanied by a vertical ellipses menu item. If we click on the vertical ellipses, Google reviews a public window about this result. If we examine the content in this window, we notice that the most prominent element is the source referring to the web domain of this result. And typically there are two types of information, either a short description of the site usually lifted from Wikipedia or the time span for which the domain has been indexed in Google search. So why did Google implement this feature? According to the future announcement on February 1, 2021, the about this result pop-up window was designed to help users make more informed decisions about what sites to visit without having to do another Google search on the sites pack plan. But there are also other uses to this feature. With the Wikipedia page linked in each about this result window, we can also access the corresponding Wikidata entries, which contain structured data that help researchers automate the process of identifying whether a site possesses a certain credibility criteria. For example, the award that the organization has received. And furthermore, the characteristics of a domain can be useful in auditing the nature of Google search results. For example, we can tackle questions such as how many domains belong to commercial entities versus nonprofit organizations, and overall what is the level of diversity of domains that appear on Google search results. And with the various meaningful applications we can gain from the content in about this result, we motivate our two research questions. First, we want to examine how often does the Google about this result link to a corresponding Wikipedia entry. And we also want to explore whether the associated Wikidata entry contains structured data that can be mapped to W3C credibility signals. And for some background information, W3C has a committee working on credibility signals, and they have proposed a large number of them. Although the list is quite long, we selected a few examples, most relevant to our research, as you can see on this slide. Now let's turn our attention to the data collection process. When we audit Google search, we try to focus on topics whose results have real-world impacts, such as political elections or gender identity. Then for each topic, we gather possible queries to search by interviewing people or other means. Some examples of the queries that we use are shown. In order to collect the search result pages, we use Solanium and Chrome driver to automate the processes, opening up the Chrome browser, setting the geolocation, searching the queries, and then finally saving the search pages as HTML files. Since the about this result window only appears after clicking on the vertical ellipses and is not contained in the static HTML code, we also use Solanium to automate this process. To be collected about this result content, we extracted the embedded Wikipedia link whenever it was present. From each Wikipedia page, we then extracted the link to the corresponding Wikidata entry by clicking on the Wikidata item using Solanium. And from there, we extracted item ID or QID for the entry. And to provide some background information on Wikidata entries, here's what a typical page looks like. In this example, the entry is on Wellesley College, which has a unique QID Q490205. If you go down the page, you can see two columns of information, properties, which are categories of information, as well as their associated values. In our example, we can see that Wellesley College is an instance of university, private, not-for-profit institution, and women's college, and it's also part of the Seven Sisters. Once we had the Wikidata entry IDs, we then extracted all property and property values using Wikidata's own Sparkle Query service, which takes in a QID and returns a complete list of properties and values for that entry. Now, with all the Wikipedia and Wikidata information on the domains appearing on Google Search, we can now answer our original research questions. Recall that the first question we asked was, how often does the Google about this result link to a corresponding Wikipedia entry? And turns out, Wikipedia's coverage of what domains appears to be low. In particular, for queries related to the 2022 US elections, only 39% of search results had Wikipedia links. And for queries related to gender expression and relationships, only 37% had Wikipedia links. Our second research question was, does a domain's associated Wikidata entry contain structured information that could be mapped to WPC credibility signals? To answer this question, we examined the common properties that appeared on the Wikidata entries we collected. As you can see from the first table, instance of official website and inception are often present on Wikidata entries with our current rates over 80%. They're all useful background information on the domain's organization potentially relating to WPC credibility criteria, such as publication type. Some other useful information, like founded by owned by an awards received that more directly correspond to WPC credibility signals are also present. However, they're more sparsely populated among Wikidata entries, as you can see in the second table on the right. If we look at the top values for the property instance of, we also see that a variety of domain organization types exist. However, some categories are actually more useful than the others. For example, being an instance of website business organization doesn't really tell us much about the nature of the domain. But being an instance of a public educational institution, open access publisher or magazine may potentially indicate something about the domain's credibility. In general, we conclude that for the audited topics, the coverage of domains in Wikipedia is less than 40%, which means that more research should be done to identify what is missing and why. In addition, we discovered that Wikidata has a number of properties. However, it's difficult to map them directly to WPC credibility signals. Future collaboration between the platforms should be needed. And last but not least, for existing properties such as instance of, the values are not always useful. Therefore, identifying more useful property values to include on Wikidata should be considered. Thank you for your attention. I will be happy to answer any questions you might have.