 Hello and welcome everyone. My name is LaLina. I am currently an outreach intern with the Wikimedia foundation and this talk is pretty recorded, but if you're attending Wikimedia live then I will be in the chat during the whole presentation. So if at any point you have a question just feel free to ask it there. Today I want to talk to you about Wikipedia and big data. Why big data? Today Wikipedia is one of the largest, if not the largest, knowledge base that's ever existed. But if you think of the typical user they will only ever access a relatively very small proportion of the millions of articles that exist in hundreds of languages. Reading all the articles from just the English Wikipedia would take you over 17 years. That's without any breaks. On the other hand, if you have just an average computer and some programming ability you will be able to analyze not just all the article contents, but also a lot of the other data that exists such as editor activity, page views, how users navigate between different pages by clicking links, and so on in just a few hours. So today I want to bring your attention to a different kind of user one who isn't navigating to individual Wikipedia pages through their browser, but rather accessing and analyzing the underlying data itself and who is in this way able to gather insights that are otherwise hidden a little bit like needles in a very big haystack. Why should you care? People who use Wikipedia in this way may represent a small proportion of all the total users but their work can have a big impact in areas such as social science or decision-making business or even global policy. So let's head over to Google Scholar. Google Scholar is a search engine for scientific papers. So if we type in Wikipedia we see that we get over two million results and this is quite impressive. But if you did the same search in 2012, you wouldn't even reach 200,000 results, which means that this is grown by a factor of 10 or so in about 10 years. Let's dive in and take a look at some of these articles. In this study from 2017 researchers from Stanford and Facebook AI used Wikipedia to build a question answering system. And a question answering system is is more or less what the virtual assistants such as Siri or Alexa are. So if you ask a question, well, then hopefully they will give you the right answer in a reasonable amount of time. And what I found interesting about this particular study is that this QA system is not just able to answer questions from Wikipedia, which of course is valid for itself. But it's also easily transferable to other contexts. So say you have a huge corpus of medical documents, then this QA system could be adapted to answer questions from this new knowledge base and for instance help doctors diagnose patients. So in this case, Wikipedia is not only useful as a means for itself, but also as a tool to create a system that can have a much wider utility. So moving on to another study, here researchers use Wikipedia to gather large amounts of social and economical data, which can otherwise be very time-consuming and expensive to do, especially from poor or inaccessible regions and countries in the world. And this is possible because many of the articles are geo-located. So here for instance, we had the Wikipedia page about Kampala, which is the capital of Uganda. And the highlighted sentence says, 13th fastest growing city on the planet. And what this study found is that when you combine the analysis of millions of statements like this one with other geo-located data such as satellite imagery and survey data, you can get a dataset that is of much higher overall quality than using the separate data sources individually. And it also contains data that would have been very difficult to obtain otherwise. And having the right data to convince stakeholders can mean the difference between your humanitarian project getting funded or not, and which regions, organisms, such as the United Nations focus their poverty relief and education initiatives on. Yet another use case for Wikipedia data is to study trends and make predictions. And in this study, scientists try to predict the box office success of around 300 movies before they were released by studying page views and editing activity on the movies Wikipedia pages. And a typical way to do something like this is by performing sentiment analysis and this is often used on Twitter. So if someone tweets, I can't wait to watch this movie. I'm a huge fan of whatever actor and this will count as positive statements. Or otherwise someone says everyone is talking about this movie, but I just don't get what all the hype is about. And this will count as negative. But in this study, the authors found that they got just as good results without using sentiment analysis by using just purely statistical methods. And this means that this method is both language and context independent. And so it could be used for many other purposes. And here we see the correlation between predicted revenue and actual revenue. And we can say that most of the points are actually quite close to the line. Being on the line would mean a perfect prediction, but this is good enough. But something as large and complex as Wikipedia is also inherently challenging to work with. Because the data is spread over many different tables and many different databases. And most use cases require combining several data sources and parsing and filtering the data before it can be transferred into some form that can be analyzed. So it can be challenging to just understand where to find and how to extract your relevant data. And especially for users that are not from a technical background or are not familiar with Wikimedia. And in some studies, this is even mentioned explicitly, like you're just saying. While being a scientific treasure, the large size of the data set hinders pre-processing and may be a challenging obstacle for potential new studies. This issue is particularly acute in scientific domains where researchers may not be technically and data processing savvy. To deal with this complexity, there are researchers who have come up with their own scripts and frameworks for working with the data. And they are sharing it on GitHub. So for instance, here we see one, it's called Wikistractor. We can see that it's got 2,700 stars and around 800 forks. And this may not seem like a lot, but actually for something that has kind of an issue, this is not bad. And just to end this, I want to say that there are many ways to contribute to Wikimedia. And this could be one of them. So if you're interested in helping build better tools or improve the technical documentation or maybe just even share what challenges you yourself struggle with, if you have worked or working with Wikimedia data, any contribution is encouraged and welcome. Here are the links to the papers and some additional info. And if you have any questions, you are very welcome to reach out on my top page. Thank you.